REVIEW
Generalized Deformable Models, Statistical Physics, and Matching Problems Alan L. Yuille Division of Applied Sc...
6 downloads
627 Views
15MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
REVIEW
Generalized Deformable Models, Statistical Physics, and Matching Problems Alan L. Yuille Division of Applied Science, Harvard University, Cambridge, M A 02138 USA
We describe how to formulate matching and combinatorial problems of vision and neural network theory by generalizing elastic and deformable templates models to include binary matching elements. Techniques from statistical physics, which can be interpreted as computing marginal probability distributions, are then used to analyze these models and are shown to (1)relate them to existing theories and (2) give insight into the relations between, and relative effectivenesses of, existing theories. In particular we exploit the power of statistical techniques to put global constraints on the set of allowable states of the binary matching elements. The binary elements can then be removed analytically before minimization. This is demonstrated to be preferable to existing methods of imposing such constraints by adding bias terms in the energy functions. We give applications to winner-takeall networks, correspondence for stereo and long-range motion, the traveling salesman problem, deformable template matching, learning, content addressable memories, and models of brain development. The biological plausibility of these networks is briefly discussed. 1 Introduction
A number of problems in vision and neural network theory that involve matching features between image frames, detecting features, learning, and solving combinatorial tasks (such as the traveling salesman problem) can be formulated in terms of minimizing cost functions usually involving either binary decision fields (for example, Hopfield and Tank 1985) or continuous fields, for example the elastic net algorithms (Durbin and Willshaw 1987). Elastic net models have been used for problems in vision, speech, and neural networks (Burr 1981a,b; Durbin and Willshaw 1987; Durbin et aZ. 1989; Kass et al. 1987; Terzopolous et al. 1987). They are also related to work of Pentland (1987) and Fischler and Elschlager (1973). Deformable templates (Yuilleet al. 1989b; Lipson et ai. 1989)are simple parameterized shapes that interact with an image to obtain the best fit of their parameters. They have more structure than elastic nets and fewer degrees of Neurul Computation 2, 1-24 (1990)
@ 1990 Massachusetts Institute of Technology
2
Alan L. Yuille
freedom (numbers of parameters). Elastic nets can be, loosely, thought of as the limit of deformable templates as the number of parameters goes to infinity. In this paper we will generalize the ideas of elastic nets and deformable templates to include binary matching fields. The resulting models, generalized deformable models, will be applied to a range of problems in vision and neural net theory. These models will contain both binary and continuous valued fields. We will describe techniques, typically equivalent to computing marginal probability distributions, that can be used to convert these models into theories involving one type of field only. This will relate our model to existing theories and also give insight into the relationships between, and relative effectivenesses of, existing theories. These techniques are best described using a statistical framework. Any problem formulated in terms of minimizing an energy function can be given a probabilistic interpretation by use of the Gibbs distribution (see, for example, Parisi 1988). This reformulation has two main advantages: (1) it relates energy functions to probabilistic models (see for example, Kirkpatrick et al. 1983; Geman and Geman 1984), thereby allowing powerful statistical techniques to be used and giving the theory a good philosophical basis founded on Bayes theorem (17831, and (2) it connects to ideas and techniques used in statistical physics (Parisi 1988). We will make use of techniques from these two sources extensively in this paper. The connection to statistical physics has long been realized. The Hopfield models (Hopfield 1984; Hopfield and Tank 1985) and some of the associated vision models (Yuille 1989) can be shown to correspond to finding solutions to the mean field equations of a system at finite temperature. A number of algorithms proposed for these models will then correspond (Marroquin 1987) to deterministic forms of Monte Carlo algorithms. The free parameter X in the Hopfield models corresponds to the inverse of the temperature and varying it gives rise to a form of deterministic annealing. An important advance, for computer vision, was taken by Geiger and Girosi (1989), who were studying models of image segmentation with an energy function E(f ,I ) (Geman and Geman 1984)that depended on a field f , representing the smoothed image, and a binary field I representing edges in the image. Geiger and Girosi (1989) observed that if the partition function of the system, defined by
was known then the mean fields f and 1 could be calculated and used as minimum variance estimators (Gelb 1974) for the fields (equivalent to the maximum a posteriori estimators in the limit as p -+ 00). They showed that the contribution to the partition function from the 1 field could be
Generalized Deformable Models
3
explicitly evaluated leaving an effective energy E e ~f )( depending on the f field only. Moreover, this effective energy was shown to be closely related (and exactly equivalent for some parameter values) to that used by Blake (1983, 1989) in his weak constraint approach to image segmentation. Eliminating the I field in this way can be thought of (Wyatt, private communication) as computing the marginal probability distribution pm(f ) = C Ep(f, I ) , where p ( f ,1 ) is defined by the Gibbs distribution p(f, I ) = e-aEcf,E)/Z with Z a normalization constant (the partition function). This yields p,( f = e-fiEeff/Zeff,where Zeff is a normalization constant. It was then realized that using the partition function gave another important advantage since global constraints over the set of possible fields could be enforced during the evaluation of the partition function. This was emphasized by Geiger and Yuille (1989) in their analysis of the winner-take-all problem, discussed in detail in the next section. The constraint that there is only one winner can be imposed either by (1) adding a term in the energy function to bias toward final states with a single winner, or ( 2 ) evaluating the partition function for the system by evaluating it only over configurations with a single winner. For this problem method (2) is definitely prefFrable since it leads directly to a formula for the minimum variance estimators. On the other hand, (1) only leads to an algorithm that might converge to the minimum variance estimators, but that might also get trapped in a local minimum. This suggested that it was preferable to impose global constraints while evaluating the partition function since the alternative led to unnecessary local minima and increased computation. This conclusion was reinforced by applications to the correspondence problem in stereo (Yuille et ul. 1989a) and long-range motion (see Section 3). Both these problems can be posed as matching problems with some prior expectations on the outcome. Further support for the importance of imposing global constraints in this manner came from work on the traveling salesman problem (TSP). It has long been suspected that the greater empirical success (Wilson and Pawley 1988; Peterson 1990) of the elastic net algorithm (Durbin and Willshaw 1987; Durbin et al. 1989) compared to the original Hopfield and Tank (1985) algorithm was due to the way the elastic net imposed its constraints. It was then shown (Simic 1990) that the elastic net can be obtained naturally (using techniques from statistical mechanics) from the Hopfield and Tank model by imposing constraints in this global way. In Section 4 we will provide an alternative proof of the connection between the Hopfield and Tank and elastic net energy functions by showing how they can both be obtained as special cases of a more general energy function based on generalized deformable models with binary matching fields (I was halfway through the proof of this when I received a copy of Simic’s excellent preprint). Our proof seems more direct than Simic’s and is based on our results for stereo (Yuille et ul. 1989b).
4
Alan L. Yuille
In related work Peterson and Soderberg (1989) also imposed global constraints on the Hopfield and Tank model by mapping it into a Potts glass. They also independently discovered the relation between the elastic net and Hopfield and Tank (Peterson, private communication). In benchmark studies described at the Keystone Workshop 1989 (Peterson 1990) both the elastic network algorithm and the Peterson and Soderberg model gave good performance on problems with up to 200 cities (simulations with larger numbers of cities were not attempted). This contrasts favorably with the poor performance of the Hopfield and Tank model for more than 30 cities. It strongly suggests that the global constraints on the tours should be enforced while evaluating the partition function (a "hard constraint in Simic's terminology) rather than by adding additional terms to bias the energy (a "soft" constraint). Generalized deformable models using binary matching elements are well suited to imposing global constraints effectively. We show in Section 5 that our approach can also be applied to (1) matching using deformable templates, (2) learning, (3) content addressable memories, and (4) models of brain development. The Boltzmann machine (Hinton and Sejnowski 1986)is another powerful stochastic method for learning that can also be applied to optimization in vision (Kienker et al. 1986). The application of mean field theory to Boltzmann machines (Peterson and Anderson 1987; Hinton 1989) has led to speed ups in learning times. 2 Using Mean Field Theory to Solve the Winner Take All
We introduce mean field (MF) theory by using it to solve the problem of winner take all (WTA). This can be posed as follows: given a set {T,} of N inputs to a system how does one choose the maximum input and suppress the others. For simplicity we assume all of the T, to be positive. We introduce the binary variables V , as decision functions, V, = 1, V, = 0 , z # w selects the winner w. We will calculate the partition function 2 in two separate ways for comparison. The first uses a technique called the mean field approximation and gives an approximate answer; it is related to previous methods for solving WTA using Hopfield networks. The second method is novel and exact. It involves calculating the partition function for a subset of the possible V,, a subset chosen to ensure that only one V, is nonzero.
2.1 WTA with the Mean Field Approximation. Define the energy function
Generalized Deformable Models
5
where v is a parameter to be specified. The solution of the WTA will have all the V, to be zero except for the one corresponding to the maximum T,. This constraint is imposed implicitly by the first term on the right-hand side of 2.1 (note that the constraint is encouraged rather than explicitly enforced). It can be seen that the minimum of the energy function corresponds to the correct solution only if v is larger than the maximum possible input T, (otherwise final states with more than one nonzero V, may be energetically favorable for some inputs). Now we formulate the problem statistically. The energy function above defines the probability of a solution for {K} to be P({V,}) = (l/Z)e-oEE‘{K)),where /3 is the inverse of the temperature parameter. The partition function is
Observe that the mean values into 2.2) from the identity
of the V , can be found (substituting 2.1
(2.3)
We now introduce the mean field approximation by using it to compute the partition function 2
When calculating the contribution to Z from a specific element V , the mean field approximation replaces the values of the other elements V, by their mean values This assumes that only low order correlations between elements are important (Parisi 1988). From Zappro,we compute the mean value V , = (-l/b) (8In Zapprox/8Tz) and obtain some consistency conditions, the mean field equations, on the
q.
This equation may have several solutions, particularly as 3( + co. We can see this by analyzing the case with N = 2. In the limit as ,8 + 00 solutions of 2.5 will correspond to minima of the energy function given by 2.1. The energy will take values -TI and -T2 at the points (V,,V2) = ( 1 , O ) and (V,, = (0,l). On the diagonal line V, = V2 = V separating these points the energy is given by E(V) = uV2 - V(Tl + T2). Using the condition that v > T,,, we can see that for possible choices of TI,T2 (with TI = v - el, T2 = v - e2 for small el and e2) the energy on the diagonal is larger than -TI and -T2, hence the energy function has at least two minima.
Alan L. Yuille
6
There are several ways to solve (2.5). One consists of defining a set of differential equations for which the fixed states are a solution of (2.5). An attractive method (described in Amit 1989) applied to other problems in vision (Marroquin 1987) and shown to be a deterministic approximation to stochastic algorithms (Metropolis et al. 1953) for obtaining thermal equilibrium is (2.6) If the initial conditions for the satisfy an ordering constraint 1 (which can be satisfied by initially setting the q s to be the same), then, by adapting an argument from Yuille and Grzywacz (1989b), the system will converge to the correct solution. However, if the ordering condition is violated at any stage, due to noise fluctuations, then it may give the wrong answer.
K for all i
2.2 WTA without the Mean Field Approximation. We now impose the constraint that the V , sum to 1 explicitly during the computation of the partition function. The first term on the right-hand side of 2.4 is now unnecessary and we use an energy function
(2.7) We compute Z by summing over all possible (binary) constraint that they sum to 1. This gives
z=
C { V , = O , l } : ~ , v,=1
e-PEyTAIKI =
CeP%
under the
(2.8)
i
In this case no approximation is necessary and we obtain
Thus, as p -+ 00 the V , corresponding to the largest Tiwill be switched on and the other will be off. This method gives the correct solution and needs minimal computation. This result can be obtained directly from the Gibbs distribution for the energy function 2.1. The probabilities can be calculated
where 2 is the normalization factor and it follows directly that the means are given by 2.9. This second approach to the WTA is clearly superior to the first. In the remainder of the paper we will extend the approach to more complex problems.
Generalized Deformable Models
7
3 Long-Range Motion Correspondence and Stereo
Both the vision problems of stereo and long-range motion can be formulated in terms of a correspondence problem between features in a set of images. We will chiefly concentrate on long-range motion since the application of these statistical ideas to stereo and some connections to psychophysical experiments is discussed elsewhere (Yuille et al. 1989a). However, many of the results for long-range motion can be directly adapted to stereo. In his seminal work Ullman (1979) formulated long-range motion as a correspondence problem between features in adjacent image time frames and proposed a minimal mapping theory, which could account for a number of psychophysical experiments. The phenomenon of motion inertia (Ramachandran and Anstis 1983; Anstis and Ramachandran 1987) shows that the past history of the motion is also relevant. However, recent theoretical studies (Grzywacz et al. 1989) and comparisons between theoretical predictions and experiments (Grzywacz 1987, 1990) suggest that the past history can often, though not always, be neglected. We will make this assumption during this section. Ullman's minimal mapping theory (1979) proposes minimizing a cost function
where the {K,}s are binary matching elements (KJ= 1 if the ith feature in the first frame matches the jth feature in the second frame, KJ = 0 otherwise) and dz3 is a measure of the distance between the zth point x , in the first frame and the gth point y3 in the second frame (&, = Ix, - y, I, for example). The cost function is minimized with respect to the {K,}S while satisfying the cover principle: we must minimize the total number of matches while ensuring that each feature is matched. If there are an equal number of features in the two frames then the cover principle ensures that each feature has exactly one match. An alternative theory was proposed by Yuille and Grzywacz (1988, 1989a), the motion coherence theory for long-range motion, which formulated the problem in terms of interpolating a velocity field v(x) between the data points. Supposing there are N features {xt} in the first frame and N features {y3} in the second they suggest minimizing (Yj -
(3.2)
with respect to {V,,} and v ( x ) , where D2% = V 2 nand ~ DZn+lv= V(V2"v). The second term on the right-hand side of 3.2 is similar to the standard
8
Alan L. Yuille
smoothness terms used in vision (Horn 1986; Bertero et al. 1987), but the use of it in conjunction with binary matching fields was novel. Yuille and Grzywacz (1989a) showed that by minimizing E [ { K j } v(x)] , with respect to the velocity field we can obtain a linear set of equations for v(x) in terms of the { Kj}and the data. Substituting this back into the energy function gives (3.3) where we have used the summation convention on the repeated indices z and J , and with d,, = xi - ya. The function G,, = G(x, - y3 : a) is a Gaussian with standard deviation a (it is the Green function for the smoothness operator). It ensures a smooth velocity field and the range of the smoothing depends on a. If we take the limit as a + 0 we see that the motion coherence theory becomes equivalent to the minimal mapping theory with the squared distance norm (this is because (AS,, + G,j)-l cx SzJ). For nonzero values of 0 the motion coherence theory gives a smoother motion than minimal mapping and seems more consistent with experiments on motion capture (Grzywacz et al. 1989). Minimal mapping and motion coherence are both formulated in terms of cost functions, which must be minimized over a subset of the possible configurations of the binary variables. It is natural to wonder whether we can repeat our success with the winner take all and compute the partition function for these theories. For minimal mapping we seek to compute
where the sum is taken over all possible matchings that satisfy the cover principle. The difficulty is that there is no natural way to enforce the cover principle and we are reduced to evaluating the right-hand side term by term, which is combinatoriallyexplosive for large numbers of points [a method for imposing constraints of this type by using dummy fields was proposed by Simic (1990) for the traveling salesman problem, however this will merely transform minimal mapping theory into a version of the motion coherence theory]. If we impose instead the weaker requirement that each feature in the first frame is matched to a unique feature in the second (i.e., for each i there exists j such that V,, = 1 and V , k = 0 for k # j ) then we will still obtain poor matching since, without the cover principle, minimal mapping theory reduces to "each feature for itself" and is unlikely to give good coherent results (for example, all features in the first frame might be matched to a single feature in the second frame).
Generalized Deformable Models
9
The situation is rather different for the motion coherence theory. Although it is equally hard to impose the cover principle the requirement that each feature in the first frame has a unique match will still tend to give a coherent motion because of the overall smoothness requirement on the velocity field (for nonzero a). We now compute the partition function using this requirement. (3.5) where E,[v] is the smoothness terms (the second term on the right-hand side of 3.2) and the "sum" over v(x) is really an integral over all possible velocity fields. Computing the sum over all binary elements satisfying the unique match requirement from frame 1 to frame 2 gives
This can be rewritten as
where the effective energy is (3.8) It is straightforward to check that the marginal distribution of v is given by (3.9) where Zeffis a normalization constant. Thus minimization of Eeff[v]with respect to v(x) is equivalent to taking the maximum a posteriori estimate of v given the marginal probability distribution. Observe that as P + cc the energy function Eefi[vl diverges unless each feature x, in the first frame is matched to at least one feature yJ in the second frame and is assigned a velocity field v(x,) FZ (yJ - x,) (if this was not the case then we would get a contribution to the energy of - log 0 = +m from the ith feature in the first frame). Thus, the constraint we imposed during our summation (going from 3.5 to 3.8) is expressed directly in the resulting effective energy.
Alan L. Yuille
10
This suggests a method to force unique correspondence between the two frames. Why not make the effective energy symmetric by adding a term (3.10)
By a similar argument to the one above minimizing this cost function will ensure that every point in the second frame is matched to a point in the first frame. Minimizing the energy Eeff[v(x)]+ E,,, will therefore ensure that all points are matched (as p 4 00). We have not explicitly ensured that the matching is symmetric (i.e., if x, in the first frame is matched to y3 in the second frame then yj is matched to x~);however, we argue that this will effectively be imposed by the smoothness term (nonsymmetric matches will give rise to noncoherent velocity fields). We can obtain the additional term E,,, by modifying the cost function of the motion coherence theory to be N
iu
g2n
+ X ~ - n!Zn -/IDnv12dx
(3.11)
ri=O
where the {V,}and {U,,} are binary matching elements from the first to the second frame and from the second to the first, respectively. As we sum over the possible configurations of the {V,}and {U,,} we restrict the configurations to having unique matches for all points in the first frame and the second frame, respectively. This gives us an effective energy for the motion coherence theory, which is somewhat analogous to the elastic net algorithm for the traveling saiesman problem (Durbin and Willshaw 1987; Durbin et al. 19891, and which can be minimized by similar deterministic annealing techniques (steepest descent with P gradually increased). Current simulations demonstrate the effectiveness of this approach and, in particular, the advantages of matching from both frames. Notice in the above discussion the play off between the set of configurations we sum over and the a priori constraints (the smoothness term E,[v]). We wish to impose as many constraints as possible during the computation of the partition function, but not at the expense of having to explicitly compute all possible terms (which would be a form of exhaustive search). Instead we impose as many constraints on the configurations as we can that make the resulting effective energy simple and
Generalized Deformable Models
11
rely on the a priori terms to impose the remaining constraints by biases in the energy. This play off between the set of configurations we sum over and the energy biases caused by the a priori terms occurs in many applications of this approach. Notice that minimal mapping theory does not contain any a priori terms and hence cannot make use of this play off. We should point out, however, that there are ways to minimize the minimal mapping energy using an analogue algorithm based on a formulation in terms of linear programming (Ullman 1979). We can also obtain an effective energy for the velocity field if line processes (Geman and Geman 1984) are used to break the smoothness constraint. The line process field and the matching elements can be integrated out directly. This calculation has been performed for stereo (Yuille et al. 1989a). Other examples are discussed in Clark and Yuille (1990). 4 The Traveling Salesman Problem The traveling salesman problem (TSP) consists of finding the shortest possible tour through a set of cities { x i > , i = 1 , .. . , N, visiting each city once only. We shall refer to tours passing through each city exactly once as admissible tours and everything else as inadmissible tours. We propose to formulate the TSP as a matching problem with a set of hypothetical cities {y,}, j = 1, . . . ,N, which have a prior distribution on them biasing toward minimum length. We can write this in the form (4.1) $3
J
where the V,, are binary matching elements as before. V,, = 1 if x, is matched to y3 and is 0 otherwise. The second term on the right-hand side of 4.1 minimizes the square of the distances; other choices will be discussed later. The idea is to define a Gibbs distribution (4.2) for the states of the system. Our “solutions” to the TSP will correspond to the means of the fields {V,,} and {y,} as p + co. We must, however, put constraints that the matrix V,, has exactly one “1” in each row and each column to guarantee an admissible tour. Observe that if the {l&}s are fixed the probability distributions for the {yJ}s are products of Gaussians. The relation between Hopfield and Tank (1985) and the elastic net algorithm (Durbin and Willshaw 1987) follows directly from 4.1. To obtain Hopfield and Tank we eliminate the {y,} field from 4.1 to obtain an energy depending on the {I&> only. The elastic net is obtained by averaging out the {V,,}s. The relation between the two algorithms has
Alan L. Yuille
12
been previously shown by Simic (1990) using different, though closely related, techniques from statistical mechanics. The elastic net algorithm seems to perform better than the Hopfield and Tank algorithm on large numbers of cities. The Hopfield and Tank algorithm has problems for more than 30 cities (Wilson and Pawley 1988), while the elastic net still yields good solutions for at least 200 cities (see benchmark studies at the Keystone Workshop 1989 presented in Peterson 1990). 4.1 Obtaining the Elastic Net Algorithm. We start by writing the partition function
(4.3)
where the sum is taken over all the possible states of the {V,,}s and the {y,}s (strictly speaking the {y,}s are integrated over, not summed over). We now try evaluating Z over all possible states of the {V,}. As noted previously, we are guaranteed a unique tour if and only if the matrix V , contains exactly one ”1” in each row and each column. Enforcing this constraint as we compute the partition function leads, however, to a very messy expression for the partition function and involves a prohibitive amount of computation. Instead we choose to impose a weaker constraint: that each city x, is matched to exactly one hypothetical city y, (i.e., for each i V,, = 1 for exactly one 3 ) . Intuitively this will generate an admissible tour since only states with each x, matched to one y, will be probable. In fact one can prove that this will happen for large p (Durbin et a2. 19891, given certain conditions on the constant v. Thus, our computation of the partition function will involve some inadmissible tours, but these will be energetically unfavorable. Writing the partition function as (4.4)
we perform the sum over the {&}s there exists exactly one j such that
using the constraint that for each i
V , = 1. This gives (4.5)
This can be written as
Generalized Deformable Models
13
where
is exactly the Durbin-Willshaw-Yuille energy function for the elastic net algorithm (Durbin and Willshaw 1987). Finding the global minimum of E,ffI{y,}I with respect to the { y J } in the limit as 0 .+ 00 will give the shortest possible tour (assuming the square norm of distances between cities). Durbin and Willshaw (1987) perform a version of deterministic annealing, minimizing E,&{y,}] with respect to the {y,} by steepest descent for small 0 and then gradually increasing p (which corresponds to 1/K2 in their notation). Computer simulations show that this gives good solutions to the TSP for at least 200 cities (see Keystone Workshop simulations reported in Peterson 1990). This algorithm was analyzed in Durbin et al. (1989) who showed that there was a minimum value of (maximum value of K ) below which nothing interesting happened and which could therefore be used as a starting value (it corresponds to a phase transition in statistical mechanics terminology). The present analysis gives a natural probabilistic interpretation of the elastic net algorithm in terms of the probability distribution 4.2 (see also Simic 1990). It corresponds to the best match of a structure {y,} with a priori constraints (minimal squared length) to the set of cities {x,}. This is closely related to the probabilistic intepretation in Durbin et al. (1989) where the { x ~ }were interpreted as corrupted measurements of the {y,}. There is nothing sacred in using a squared norm for either terms in 4.3. The analysis would have been essentially similar if we had, for example, ( . would simply replaced it by terms like V,, Ix, - y,I and ly, - Y ~ + ~ This give an alternative probability distribution. There do, however, seem to be severe implementation problems that arise when the modulus is used instead of the square norm (Durbin, personal communication). Until these difficulties are overcome the square norm seems preferable. 4.2 Obtaining Hopfield and Tank. The Hopfield and Tank algorithm can be found by eliminating the {y,}s from the energy 4.1. This could be done by integrating them out of the partition function (observe that this would be straightforward since it is equivalent to integrating products of Gaussians). Instead we propose, following Yuille and Grzywacz (1989a), to eliminate them by directly minimizing the energy with respect to the {y,}s, solving the linear equations for the {y,}s and substituting back into the energy to obtain a new energy dependent only on the {V,,}s. (The two methods should give identical results for quadratic energy functions.) This process will give us an energy function very similar to Hopfield and Tank, provided we make a crucial approximation.
Alan L. Yuille
14
Minimizing the energy 4.1 with respect to the {yj}s gives a set of equations
for each j . We use the constraint that there is a unique match to set xi &yj = yj. This gives (4.9)
(4.10) where the Aij = 26, obtain
-
S2,,+l -
yz = C{&zk+ vAZk)-l k
C WQ
We can now invert the matrix to (4.11)
1
We now make the important approximation that if v is small then
{L+ ~ A , k } - lM
{& - v A , ~ }
(4.12)
We can see from the form of the matrix A,, that this corresponds to assuming only nearest neighbor interactions on the tour (the full expansion in 4.11 would involve higher powers of A,, and hence would introduce interactions over the entire tour). Assuming the approximation 4.12 we can now use 4.11 to substitute for yz into the energy function 4.1. After some tedious algebra we obtain
(4.13) where djk = Ix, - XkI is the distance between the jth and kth cities. Equation 4.13 is the basis for the Hopfield and Tank (1985) energy function (although they use d 3 k instead of d$). To impose the constraints on admissible tours they incorporate additional terms B C ,Cj,k, V,,V,k+ B C , C,,k, f k ~ , V ~ , + C V,, { ~- ,N, } 2 into the energy function. These terms will make inadmissible tours energetically unfavorable, if the constants B and C are chosen wisely, but at the cost of introducing many local minima into the energy function. We see two drawbacks to this scheme that decrease its effectiveness by comparison with the elastic net algorithm. First, when we make the approximation 4.12 we keep only nearest neighbor interactions between the cities and second, the constraint on admissible tours is enforced by a
Generalized Deformable Models
15
bias in the energy function ("soft," Simic 1990) rather than explicitly in the calculation of the partition function ("hard," Simic 1990). Both these factors contribute to increase the number of local minima of the energy function, probably accounting for the diminished effective of this scheme for more than 30 cities. Hopfield and Tank attempt to minimize E[{L$}]by an algorithm that gives them solutions to the mean field equations for the {K:,}. Again, as for the winner take all, there are many possible solutions to the mean field equations (roughly corresponding, in the limit as p -+ co, to the number of local minima of the energy function). There is a simple intuitive way to see the connection between the matching energy function 4.1 and that of Hopfield and Tank in the small v limit (Abbott, private communication). For small v minimizing the energy will encourage each x, to be matched to one y:, (y, M C,KJx,). Substituting this in the second term gives us the Hopfield and Tank energy 4.13 with the squared distance. A similar argument suggests that if the second term is the absolute distance, rather than the distance squared, we would expect to obtain the exact Hopfield and Tank energy. This argument might be made rigorous by integrating out the {y3} from the energy 4.1 using the absolute distance instead of the squared distance. 5 Deformable Templates, Learning, Content Addressable Memories, and Development
We now briefly sketch other applications of these ideas to deformable templates, learning, content addressable memories, and theories of development. Computer simulations of these ideas are described elsewhere. 5.1 Deformable Templates and Matching. A deformable template (Yuille et al. 1989b) is a model described by a set of parameters so that variations in the parameter space correspond to different instances of the model. In this section we will reformulate deformable templates in terms of matching elements. Suppose we have a deformable template of features { y j ( a ) } , with j = 1,.. . , N , depending on a set of parameters a and we want to match it to a set of features {x,} with i = 1 , . . . ,M . We can define a cost function
(5.1) where A,, is a compatibility matrix A,, = 1 if the features labeled by i and j are totally compatible and A , = 0 if they are totally incompatible, the {Kj] are binary matching elements, X is a constant, and ,!?,,(a)imposes prior constraints on the parameters a. There is nothing sacred about
Alan L. Yuille
16
using the square norm, n = 2, as a measure of distance in 5.1; for some applications other norms are preferable. We can now impose constraints on the possible matches by summing over the configurations of {qj}.We require that for each j there exists a unique i such that = 1. Calculating the sum over these configurations gives an effective energy
x,
(5.2) a
where n will typically be 2. Minimizing &@[a]with respect to a corresponds to deforming the template in parameter space to find the best match. We can extend this from features to matching curves, or regions, by defining
(5.3) where 4(x) is a compatibility field, which could represent the edge strength, between the properties of the template and those of the image. In the limit as n co and X + 0 this becomes the form --f
E [ y ( a ) l=
14Iy(a)ldy+
(5.4)
Ep(4
used by Yuille ei al. (1989b). A special case comes from rigidly matching two-dimensional sets of features. Here the parameters correspond to the rotation and translation of one set of the features. In other words the energy is
E[{r/z,}l, T, QI
=
C A,K,
(x, - TyJ -
+ E p V , Q)
(5.5)
w
where T denotes the rotation matrix and a the translation. If N = M , A,, = 1 and Ep = 0 this reduces to the well-known least-squares matching problem. Although we could integrate out the {K,} again and obtain an effective energy for this problem in terms of the T and a, however, it is not obvious how to minimize this and there seem to be other techniques based on moments that are preferable. 5.2 Learning. We will briefly sketch how some of these ideas may be applied to learning and to content addressable memories. In a paper that has attracted some attention Poggio and Girosi (1989) have argued that learning should be considered as interpolation using least-squares minimization with a smoothing function. They then note that the resulting function can be written as a linear combination of basis functions that, for suitable choices of smoothness operator, can fall
Generalized Deformable Models
17
off with distance and can even be chosen to be Gaussians (Yuille and Grzywacz 1989a) (see Section 3). This gives a direct link to learning theories using radial basis functions (Powell 1987). We argue, however, that smoothness is often not desirable for learning. If the function to be learned needs to make decisions, as is usually the case, then smoothness is inappropriate. In this case the types of models we have been considering would seem preferable. Suppose we have a set of input-output pairs x, yz. Poggio and Girosi (1989) obtain a function that minimizes
-
where L is a smoothness measure. Instead we would suggest minimizing
E[{v,},YI
=
c v,
/(z- d2 + (y
-
yiq
(5.7)
with respect to {q}and y. Computing the effective energy by summing over the {K}s with the constraint that V, = 1 for only one i, we obtain
(5.8) In the limit as P 403 this corresponds to finding the nearest neighbor zi to x and assigning the value of yi to it. As we reduce /3 the range of interaction increases and the xj further away from x also influences the decision. We can also prune the number of input-output pairs in a similar manner. We pick a hypothetical set of input-output pairs w3 + z j and minimize
The { Kj} are summed over with the assumption that for each i there is only one j such that K j = 1. This forces the hypothetical input-output pairs to match the true input-outputs as closely as possible. We obtain (5.10)
In related work Durbin and Rumelhart (personal communication) have applied the elastic net idea (without binary decision units) to solve clustering problems (given a set of data points { x i } find an elastic net with a smaller number of points that best fits it) but have not extended it to learning. There may also be a connection to a recent function interpolation algorithm of Omohundro (personal communication).
18
Alan L. Yuille
5.3 Content Addressable Memories. The basic idea of a content addressable memory is to store a set of vectors {y”} so that an input vector y gets mapped to the closest y,. An easy way to do this is once again to define matching elements V, and minimize
(5.11) This is very similar to the winner-take-all network discussed in Section 2. Summing over all configurations with C,, V, = 1 and calculating the mean fields gives (5.12) It is straightforward to define a network having these properties. Observe that convergence is guaranteed to a true memory, unlike most content addressable memories (which have local minima). Our memory requires approximately ( N + l ) M connections and N ”neurons” to store N memories consisting of real valued vectors with M components. By constrast the the Hopfield storage recipe requires N 2 connections and N “neurons” to store up to 0.15N binary valued vectors with N components (Hopfield 1982; Amit 1989). The capacity of the Hopfield network can be improved to a maximum of 2N memories by using the optimal storage ansatz (Cover 1965; Gardner 1988). Redundancy could be built into our proposed network by using groups of neurons to represent each memory. The destruction of individual neurons would minimally affect the effectiveness of the system. 5.4 Developmental Models. In an interesting recent paper Durbin and Mitchison (1990) propose that the spatial structure of the cortical maps in the primary visual cortex (Blasdel and Salama 1986) arise from a dimension reducing mapping f from an n-dimensional parameter space R” (they choose n = 4 corresponding to retinotopic spatial position, orientation, and occular dominance) to the two-dimensional surface of the cortex. Their model assumes that the mapping f:R” --+ R2 should be as smooth as possible, thereby reducing the total amount of ”wiring” in the system. This is imposed by requiring that the map minimizes the functional
(5.13)
where the sum is taken to lie over neighboring points x and y in Rn and 0 < p < 1 (this range of values of p is chosen to give good agreement between the resulting distribution of the parameters on the surface cortex).
Generalized Deformable Models
19
Durbin and Mitchison propose an elastic network algorithm analogous to the elastic net for the TSP (Durbin and Willshaw 1987). It is not clear, however, that their elastic net minimizes the energy function 5.13. We now use statistical techniques to relate the energy function 5.13 directly to an elastic net algorithm. This is similar, but not identical, to that used by Durbin and Mitchison. Like them we will illustrate the result on a mapping from a square lattice in R2 to a lattice in R (it is straightforward to generalize the method to higher dimensions). The square in R2 consists of a set of points (z2, y3) for i = 1 . . . , N and j = 1,.. . , N . The lattice in R has points z, for a = 1,.. . , M . We once again define binary matching elements I&, such that KJa= 1 if the point (z2) y,) corresponds to 2,. We can now define a cost function (5.14) where E,[f(z,y)] imposes smoothness requirements on the mapping f ( z , y) and X is a constant. A typical form would be
&[f(z,y)l = ~ { l f ( & + ly3) ,
-
f(% YJP
+ If(z2,%+1)
-
f(G)Y,)lPl
23
(5.15) It is straightforward to see that as n -+ co (and X + 0) the energy function E[&, f ( z , y)] approaches that of Durbin and Mitchison (1990) (the function f will map lattice points into lattice points and will satisfy a similar smoothness measure). Once again we obtain an effective energy function for f(z,y) that will give rise to an elastic net algorithm (by steepest descent). The situation is exactly analogous to the long-range motion theory described in Section 3. We should probably impose the uniqueness matching condition in both directions (see Section 3), although it might be sufficient to require that each point in the lattice in R receives at least one input from the lattice in R2. Performing similar calculations to Section 3 gves
+
If(.i,
Y j + d - f(.i,
Y3)IP)
(5.16)
The algorithm proceeds by minimizing Eeff[f(xry)] with respect to its values f(xi, y,) on the lattice [clearly a continuum limit can be attained
20
Alan L. Yuille
by imposing a differential smoothing constraint on f(z,y) and minimizing with respect to f(z,y), as for the velocity field in Section 31. The parameter p can be varied to allow a coarse to fine approach. We have not simulated this algorithm so we cannot compare its performance to that of Durbin and Mitchison (1990). But the above analysis has shown the theoretical relation between their optimality criteria 5.13 and their mechanism (elastic nets) for satisfying it. 6 Conclusion
We have described a general method for modeling many matching problems in vision and neural networks theory in terms of elastic network and deformable templates with binary matching fields. Our models are typically defined in terms of an energy function that contains both continuous valued fields and binary fields. A probabilistic framework can be developed using the Gibbs distribution, which leads to an interpretation in terms of Bayes theorem. Eliminating the binary field by partially computing the partition function, or by computing the marginal probability distributions of the continuous field, reduces our models to the standard elastic network form. Alternatively, eliminating the continuous variables often reduces our models to previous theories defined in terms of the binary fields only. Eliminating the binary fields is usually preferable since it allows us to impose global constraints on these fields. By contrast, when we eliminate the continuous variables global constraints must be imposed by energy biases, which leads to unnecessary local minima in the energy function thereby making such theories less effective (unless global constraints are imposed by the methods used by Peterson and Soderberg 1989). It is good to put as many constraints as possible into the computation of the partition function. But too many constraints may make this too complicated to evaluate (as for the TSP). However, good choices of the prior constraints on the continuous fields enable us to impose some constraints by energy biases. It is interesting to consider biological mechanisms that could implement these algorithms and, in particular, allow for learning and content addressable memories. Preliminary work suggests that biologically plausible implementations using thresholds are possible, provided that the thresholds vary during learning or memory storage. This suggests an alternative to the standard theory in which learning and memory storage are achieved by changes in synaptic strength. Interestingly a biologically plausible mechanism for altering thresholds for classical conditioning has been recently proposed by Tesauro (1988). Finally it would be interesting to compare the approach described in this paper to the work of Brockett (1988, 19901, which also uses analog systems for solving combinatorial optimization problems. His approach
Generalized Deformable Models
21
involves embedding the discrete group of permutations into a Lie group. The solution can then be thought of as the optimal permutation of the initial data, and can be found by steepest descent in the Lie group space. This technique has been sucessfully applied to list sorting and other combinatorial problems.
Acknowledgments I would like to thank the Brown, Harvard, and M.I.T. Center for Intelligent Control Systems for a United States Army Research Office grant number DAAL03-86-C-0171 and an exceptionally stimulating working environment. I would also like to thank Roger Brockett for his support and D. Abbott, R. Brockett, J. Clark, R. Durbin, D. Geiger, N. Grzywacz, D. Mumford, S. Ullman, and J. Wyatt for helpful conversations.
References Anstis, S. M., and Ramachandran, V. S. 1987. Visual inertia in apparent motion. Vision Res. 27, 755-764. Amit, D. J. 1989. Modeling Brain Function. Cambridge University Press. Cambridge, England. Bayes, T. 1783. An essay towards solving a problem in the doctrine of chances. Phi/. Trans. Roy. Soc. 53, 370-418. Bertero, M., Poggio, T., and Torre, V. 1987. Regularization of ill-posed problems. A.I. Memo. 924. MIT A.I. Lab., Cambridge, MA. Blake, A. 1983. The least disturbance principle and weak constraints. Pattern Recognition Left. 1, 393-399. Blake, A. 1989. Comparison of the efficiency of deterministic and stochastic algorithms for visual reconstruction. P A M l ll(l),2-12. Blasdel, G. G., and Salama, G. 1986. Voltage-sensitive dyes reveal a modular organization in monkey striate cortex. Nature (tondon) 321,579-585. Brockett, R. W. 1988. Dynamical systems which sort lists, diagonalize matrices and solve linear programming problems. Proceedings of the 1988 I E E E Conference on Decision and Control. IEEE Computer Society Press, Washington, D.C. Brockett, R. W. 1990. Least squares matching problems. J. Linear Algebra Appl. In press. Burr, D. J. 1981a. A dynamic model for image registration. Comput. Graphics Image Process. 15, 102-112. Burr, D. J. 1981b. Elastic matching of line drawings. I E E E Trans. Pattern Anal. Machine Intelligence. PAMl 3(6), 708-713. Clark, J. J., and Yuille, A. L. 1990. Data Fusion for Sensory Information Processing Systems. Kluwer Academic Press. Cover, T. M. 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. I E E E Trans. Electron. Cornput. EC-14,326.
22
Alan L. Yuille
Durbin, R., and Mitchison, G. 1990. A dimension reduction framework for cortical maps. Nature (London). In press. Durbin, R., and Willshaw, D. 1987. An analog approach to the travelling salesman problem using an elastic net method. Nature (London) 326, 689-691. Durbin, R., Szeliski, R., and Yuille, A. L. 1989. An analysis of the elastic net approach to the travelling salesman problem. Neural Comp. 1, 348-358. FiscNer, M. A., and Elschlager, R. A. 1973. The representation and matching of pictorial structures. IEEE Truns. Comput. 22, 1. Gardner, E. 1988. The space of interactions in neural network models. J. Phys. A: Math. Gen. 21, 257-270. Geiger, D., and Girosi, F. 1989. Parallel and deterministic algorithms from MRFs: Integration and surface reconstruction. Artificial Intelligence Laboratory Memo 1224. Cambridge, M.I.T. Geiger, D., and Yuille, A. 1989. A Common Framework for h u g e Segmentation. Harvard Robotics Laboratory Tech. Rep. No. 89-7. Gelb, A. 1974. Applied Optimal Estimation. MIT Press, Cambridge, MA. Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. PAMI 6, 721-741. Grzywacz, N. M. 1987. Interactions between minimal mapping and inertia in long-range apparent motion. Invest. Ophthalmol. Vision 28, 300. Grzywacz, N. M. 1990. The effects of past motion information on long-range apparent motion. In preparation. Grzywacz, N. M., Smith, J. A., and Yuille, A. L. 1989. A computational theory for the perception of inertial motion. Proceedings IEEE Workshop on Visual Motion. Irvine. Hinton, G. E. 1989. Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Comp. 1, 143-150. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Programming, Vol. 1, D. E. Rummelhart and J. L. McClelland, eds. MIT Press, Cambridge, MA. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Nutl. Acad. Sci. U.S.A.79,25542558. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of a two-state neuron. PYOC.Natl. Acad. Sci. U.S.A. 81,3088-3092. Hopfield, J. J., and Tank, D.W. 1985. Neural computation of decisions in optimization problems. Biol. Cybernet. 52, 141-152. Horn, B. K. P. 1986. Robot Vision. MIT Press, Cambridge, MA. Kass, M., Witkin, A., and Terzopoulos, D. 1987. Snakes: Active Contour Models. In Proceedings of the First lntemationa?Conference on Computer Vision, London. IEEE Computer Society Press, Washington, D.C. Kienker, P. K., Sejnowski, T. J., Hinton, G. E., and Schumacher, L. E. 1986. Separating figure from ground with a parallel network. Perception 15, 197216. Kirkpatrick, S., Gelatt, C. D., Jr., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220, 671-680. Lipson, P., Yuille, A. L., OKeefe, D., Cavanaugh, J., Taaffe, J., and Rosenthal, D.
Generalized Deformable Models
23
1989. Deformable Templatesfor Feature Extraction from Medical Images. Harvard Robotics Laboratory Tech. Rep. 89-14. Marr, D. 1982. Vision. Freeman, San Francisco. Marroquin, J. 1987. Deterministic Bayesian estimation of Markovian random fields with applications to computational vision. In Proceedings of the First International Conference on Computer Vision, London. IEEE Computer Society Press, Washington, D.C. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equation of state calculations by fast computing machines. J. Phys. Ckem. 21,1087-1091. Parisi, G. 1988. Statistical Field Theoy. Addison-Wesley, Reading, MA. Pentland, A. 1987. Recognition by parts. In Proceedings of the First International Conference on Computer Vision, London. IEEE Computer Society Press, Washington, D.C. Peterson, C. 1990. Parallel distributed approaches to combinatorial optimization problems - benchmark studies on TSP. LU TP 90-2. Manuscript submitted for publication. Peterson, C., and Anderson, J. R. 1987. A mean field theory learning algorithm for neural networks. Complex Syst. 1,995-1019. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. lnt. J. Neural Syst. 1(1),3-22. Poggio, T., and Girosi, F. 1989. A theory of networks for approximation and learning. A.I. Memo 1140. M.I.T. Powell, M. J. D. 1987. Radial basis functions for multivariate interpolation. In Algorithms for Approximation, J. C. Mason and M. G. Cox, eds. Clarendon Press, Oxford. Ramachandran, V. S., and Anstis, S. M. 1983. Extrapolation of motion path in human visual perception. Vision Res. 23, 83-85. Simic, P. 1990. Statistical mechanics as the underlying theory of "elastic" and "neural" optimization. NETWORK: Comp. Neural Syst. I(1), 89-103. Terzopoulos, D., Witkin, A., and Kass, M. 1987. Symmetry-seeking models for 3D object recognition. In Proceedings of the First International Conference on Computer Vision, London. IEEE Computer Society Press, Washington, D.C. Tesauro, G. 1988. A plausible neural circuit for classical conditioning without synaptic plasticity. Proc. Natl. Acad. Sci. U.S.A. 85, 2830-2833. Ullman, S. 1979. The Interpretation of Visual Motion. MIT Press, Cambridge, MA. Wilson, G. V., and Pawley, G. S. 1988. On the stability of the travelling salesman problem algorithm of Hopfield and Tank. Biol. Cybernet. 58, 63-70. Yuille, A. L. 1989. Energy functions for early vision and analog networks. Biol. Cybernet. 61, 115123. Yuille, A. L., and Grzywacz, N. M. 1988. A computational theory for the perception of coherent visual motion. Nature (London) 333, 71-74. Yuille, A. L., and Grzywacz, N. M. 1989a. A mathematical analysis of the motion coherence theory. lntl. J. Comput. Vision 3, 155-175. Yuille, A. L., and Grzywacz, N. M. 1989b. A winner-take-all mechanism based on presynaptic inhibition feedback. Neural Comp. 1,334-347.
24
Alan L. Yuille
Yuille, A. L., Geiger, D., and Biilthoff, H. H. 1989a. Stereo Integration, Mean Field and Psyckopkysics. Harvard Robotics Laboratory Tech. Rep. 89-10. Yuille, A. L., Cohen, D. S., and Hallinan, P. W. 1989b. Feature extraction from faces using deformable templates. Proceedings of Computer Vision and Pattern Recognition, San Diego. IEEE Computer Society Press, Washington, D.C.
Received 8 January 1990; accepted 1 February 1990.
Communicated by Nabil H. Farhat
An Optoelectronic Architecture for Multilayer Learning in a Single Photorefractive Crystal Carsten Peterson* Stephen Redfield James D. Keeler Eric Hartman Microelectronics and Computer Technology Corporation, 3500 West Balcones Center Drive, Austin, TX 78759-6509 USA
We propose a simple architecture for implementing supervised neural network models optically with photorefractive technology. The architecture is very versatile: a wide range of supervised learning algorithms can be implemented including mean-field-theory, backpropagation, and Kanerva-style networks. Our architecture is based on a single crystal with spatial multiplexing rather than the more commonly used angular multiplexing. It handles hidden units and places no restrictions on connectivity. Associated with spatial multiplexing are certain physical phenomena, rescattering and beam depletion, which tend to degrade the matrix multiplications. Detailed simulations including beam absorption and grating decay show that the supervised learning algorithms (slightly modified) compensate for these degradations. Most network models are based on "neurons," V,, interconnected with synaptic strengths zJ,and a local updating rule:
where g(z) is a nonlinear gain function. The majority of neural network investigations are performed with software simulations, either with serial or parallel architectures. Many applications would benefit tremendously from custom-made hardware that would facilitate real-time operation. Neural network algorithms require a large number of connections (the matrix elements T& of equation 1.1). Optics offers the advantage of making these connections with light beams. Several architectural proposals for optical implementations now exist (Psaltis and Farhat 1985; Soffer et aI. 1986; Psaltis ef at. 1982). Most *Present Address: Department of Theoretical Physics, University of Lund, Solvegatan 14A, S-22362 Lund, Sweden.
Neural Computation 2, 25-34 (1990)
@ 1990 Massachusetts Institute of Technology
Carsten Peterson et al.
26
deal with Hebbian learning (no hidden units) using either spatial light modulators (SLMs) or photorefractive crystals. The latter technology, in which the Zj-elements are represented by holographic gratings, is particularly well suited for neural network implementations. The gratings decay naturally, and this can be exploited as a beneficial adaptive process. In Psaltis et al. (1982), a multilayer architecture of several photorefractive crystals was designed to implement the backpropagation algorithm with hidden units. In this letter, we present a single crystal architecture that is versatile enough to host a class of supervised learning algorithms, all of which handle hidden units. In contrast to other approaches we use spatial multiplexing (Anderson 1987; Peterson and Redfield 1988) rather than angular multiplexing of the interconnects. This spatial multiplexing implies rescattering and beam depletion effects for large grating strengths and large numbers of neurons. We demonstrate, on the simulation level, how the supervised learning models we consider implicitly take these effects into account by making appropriate adjustments. Photorefractive materials can be used as dynamic storage media for holographic recordings. A recording takes place as follows: With the object (1) and reference (2) beam amplitudes defined as
El = Ale-a&.'t
(1.2)
the intensity pattern of the two-wave mixing takes the form'
where intensities I = 1 @ ; 1 2 have been introduced. The refractive index is locally changed in the photorefractive material proportional to this periodic intensity pattern. The so-called grating efficiency 7 is to a good approximation proportional to the incoming beam intensities
Consider a crystal where grating strengths qa3have been created with the interference of equation 1.4. Let a set of light beams = impinge on these gratings. The outgoing electric fields can be written as
sa
23 -- AJe-arZ,F =
c z
vI12ea2,, 23
?Ate-& i=
c q:FAte-atCl
't
1
Weglecting a constant phase shift due to the relative phases of the beams.
(1.6)
Optoelectronic Architecture for Multilayer Learning where
27
& = i i 5. That is, a matrix multiplication of amplitudes -
(1.7) is performed by the photorefractive medium. Thus, identifying the amplitudes A, with the neuronic values V,, and :? : with the connection strengths TzJ,the matrix multiplication of equation 1.1 can be irnplemented. Correspondingly, equation 1.5 implements a Hebb-like learning rule. The reconstruction, or readout, process is partially destructive. The efficiency decays exponentially with the readout duration, for a given read energy density. In the past this grating decay has been a problem because the use of the neural network would gradually fade the recordings. A technique has recently been discovered for controlling this destruction rate by choosing appropriate applied fields and polarizations of the object and reference beams3 (Redfield and Hesselink 1988). Equation 1.7 can be implemented either with angular (Psaltis et al. 1982) or spatial multiplexing (Anderson 1987; Peterson and Redfield 1988) techniques. In Redfield and Hesselink (1988) it was observed that at most 10-20 gratings can be stored at the same localized region with reasonable recall properties using angular multiplexing. For this reason we have chosen the spatial multiplexing approach, which corresponds to direct imaging (see Fig. 4). With direct imaging, the intensities from the incoming plane of pixels become depleted and rescattered when passing through the crystal, causing the effective connection strengths to differ from the actual matrix elements TzJ(see Fig. 1). For relatively small systems and grating (vl/’) sizes these effects are negligible; in this case the identification of Tt3with :?: is approximately valid. However, these effects are likely to pose a problem for realistic applications with large numbers of neurons. We estimate the rescattering and beam depletion effects on Ifut = A: (equation 1.7) by explicit simulation of the reflection [coefficient qtJ1and transmission [coefficient (1-qtJ)] of the intensity arriving at each grating4 In Figure 1 we show the emergent light given random ’7:1 values on the interval [0,0.1]. The data represents an average of different random input patterns of intensity strengths in the interval [0,1]. The major effects are in the “end” of the crystal. Clearly, if the 7:: were set according to a pure (unsupervised) Hebbian rule, a large network would produce incorrect answers due to the depletion and rescattering effects. We demonstrate below how to overcome this problem with supervised learning algorithms together with a temperature gradient procedure for the output amplifiers. In our architecture, the input and output neurons are planes of n2 pixels and the connection matrix is a volume that can map at most n3 ~~
3This technique yields a recording half-life of O(1Ohrs) for continuous readout. *It is sufficient for our purposes to investigate this effect on a slice of the volume.
Carsten Peterson
28
et al.
0 55
0.50
0.45
E m T g
0.40
e
t
I 0.35 1
'J
t 0.30
0.25
0 20
a
100
200
300
4 DO
500
Distance into Crystal
Figure 1: Emergent light Utet) as function of distance (number of grating sites) into the crystal (i) for random T$ values and input intensities (see text). connections. If we want full connectivity we are thus short one dimension (Psaltis et al. 1982). The volume can serve only n3I2 neurons with full connectivity so we need an n3f2+ n3I2 mapping between the SLMs. There are many ways to accomplish this. We have chosen to use multiple pixels per neuron. The input plane is organized as follows: each of the n rows contains neurons replicated ,hi times. In the output plane, each row contains fi replicas of the sequence i, i + 1,. . . ,fi,where i = 1 for the first row, i = 1 + 6for the second row, z = 1 + 2 6 for the third row, etc. By deliberately omitting selected elements, architectures with limited (e.g., layered) connectivity can be obtained. We begin by describing how to implement the mean-field-theory ( M I V learning algorithm (Peterson and Hartman 1988). We then deal with feedforward algorithms. The MET algorithm proceeds in two phases followed by a weight update:
+
1. Clamped phase. The visible units are clamped to the patterns to be learned. The remaining units then settle (after repeated updating) according to (1.8)
Optoelectronic Architecture for Multilayer Learning
29
where the “temperature” T sets the strength of the gain of the gain function g(z) = $11+ tanh(z)l. 2. Free phase. The input units alone are clamped and the remaining units settle according to
In each of these phases, the units settle with a decreasing sequence of temperatures, To > TI > . . . > T&l. This process is called simulated annealing (Rumelhart and McClelland 1986). Equations 1.8 and 1.9 are normally updated asynchronously. We have checked that convergence is also reached with synchronous updating, which is more natural for an optical implementation. After the two phases, updating (or learning) takes place with the rule
AT,,
= P(V,V,
-
Y’Y’)
(1.10)
where P is the learning rate. As it stands, equation 1.10 requires storing the solutions to equations 1.8,1.9 and subtracting, neither of which is natural in an optical environment. There are no fundamental obstacles, however, to doing intermediate updating. That is, we update with
AT,, = PKV,
(1.11)
after the clamped phase and
AT,, = -PV,’V,’
(1.12)
after the free phase. We have checked performance with this modification and again find very little degradation. The grating strengths qi!* must necessarily be positive numbers less than 1. However, most neural network algorithms require that Tz, can take both positive and negative values, constituting a problem for both optical and electronic implementations. The most straightforward solution of several possible solutions (Peterson ef al. 1989)to this problem is to have two sets of positive gratings, one for positive enforcement (T;) and one for negative (T,;). The negative sign is then enforced electronically with a subtraction and equation 1.1 reads (1.13)
In the modified MFT learning algorithm, the adjustment of equation 1.11 is always positive while the adjustment of equation 1.12 is always negative. So the clamped phase need only affect positive weights and the free phase need only affect negative weights. In Figure 2, generic read and write cycles for MET are shown (only a slice of the volume is
Carsten Peterson et al.
30
READ CYCLE
[PRODUCTION]
= electronic subtraction
a
= sigmoid amplifier
WRITE CYCLE [LEARNING]
p l =phase 1 p2 = phase 2
Figure 2: (a) An MET read (production)cycle. (b) An MFT write (learning) cycle. shown). As can be seen, each connection strength is represented in the crystal by two gratings, TG and Thus n neurons require 2n2 connections (for full connectivity). In the read (production) cycle, the reference beam is iteratively processed through the crystal until it has settled according to equation 1.8 (or equation 1.9). The write (learning) cycle (Fig. 21, is slightly more complicated and differs between the clamped and free phases. In the clamped phase, the beam again settles according to equation 1.8. It is then replicated as two
z;.
Optoelectronic Architecture for Multilayer Learning
31
beams, “reference” and ”object”. The two beams impinge simultaneously on the crystal with the object beam hitting only the TG columns. The interconnect weights are then adjusted where the beams cross. The free phase proceeds in the same way except that the object beam hits only the Tt; columns. We now briefly discuss the implementation of feedforward models, again using a single crystal with spatial multiplexing. Optical implementations of the backpropagation (BP) algorithm have been investigated elsewhere (Psaltis et al. 1982), but these investigations have focused mainly on angular multiplexing with a multi-crystal architecture. We restrict the discussion to three layer networks (one hidden layer) with input-to-hidden and hidden-to-output connections. For such networks, symmetric feedback output-to-hidden connections are required for BP (as they are in MET). The neurons are arranged on the SLMs in layers. We denote the input-to-hidden weights as T,k ,hidden-to-output as W, ,and output-to-hidden as W,, = W,, . We have developed a procedure for implementing BP in the crystal such that the updates to the W,, weights are exact, and the updates to the T3k weights are correct to first-order in p. Hence, for learning rates small enough compared with the magnitude of W,, and g’(h,), this procedure is faithful to the exact BP algorithm. While space does not allow a detailed description (see Peterson et al. 1989), the procedure entails a two-stage backward pass analogous to the MET modification of equations 1.11,1.12: for each pattern, the set of “positive” (“negative”) gratings are written with the weight change component due to the positive (negative) term in the error expression. Other learning algorithms for feedforward networks, such as the Kanerva associative memory (Kanerva 1986) and the Marr and Albus models of the cerebellum (Keeler 1988), can be implemented in a very similar (and even simpler) manner. We have conducted simulations of the modified MET and BP learning algorithms on the spatially multiplexed architecture. In addition to rescattering, beam depletion, and double columns of weights, the simulations contained the following ingredients: Temperature gradient. Beam depletion (see Fig. 1 and absorption, below) is the one effect in the crystal for which the MET and BP algorithms were not able to automatically compensate. We found that we could counterbalance the asymmetry of the emergent light by varying the gain (or inverse temperature) across the crystal. The gain increases with depth into the crystal. Without this technique, none of our simulations were successful. Absorption. The crystal absorbs light, both in read and write phases, exponentially with depth into the crystal. Grating decay. During continuous exposure to illumination, the crystal gratings decay exponentially in time.
Carsten Peterson ef al.
32
Simulations were performed for three different problems: random mappings of random (binary) vectors, the exclusive-or (XOR) or parity problem, and the 6 x 6 mirror symmetry problem (Peterson and Hartman 1988). Both MFT and BP successfully learned all three problems in simulations. In Figure 3 we show the results for the mirror symmetry problem using 12 hidden units and 36 training patterns. As can be seen from this figure, the supervised learning algorithms have the expected property of adjusting the weights such that the various physical effects are taken care of. The system configuration has two principal optical paths, a reference path (1)and an object path (2) (see Fig. 4). Each path has a spatial filter, a beam splitter, a SLM, and an imaging lens system. The object path ends with a CCD array. The photorefractive crystal is SBN and an argon ion laser is used as a coherent light source. Thresholding (1/2(1 + tanh(z)))
105
i 100 t
s 0
r r
95 90
85
e
=t
80
75 70 0
20
40
60
80
100
Epochs
Figure 3: Learning performance of MET and BP on the 6 x 6 mirror symmetry problem (horizontal axis is number of training epochs).
33
Optoelectronic Architecture for Multilayer Learning
n __
n CCD /-/
SLM
Neuron outputs
c)
n Neuron inputs
Interconnect Volume
(a)
Laser
Path a 2
Figure 4: System configuration.
34
Carsten Peterson et al.
and loading of the SLMs take place in E electronically. Both mean-fieldtheory and backpropagation learning algorithm implementations have distinctive read (a1 plus a2) and write (a2 plus b) phases that use this generic architecture.
References Anderson, D. 1987. Adaptive interconnects for optical neuromorphs: Demonstrations of a photorefractive projection operator. In Proceedings of the I E E E First International Conference on Neural Networks 111, 577. Kanerva, P. 1986. Sparse Distributed Memory. MIT Press/Bradford Books, Cambridge, MA. Keeler, J. D. 1988. Comparison between Kanerva’s SDM and Hopfield-type neural networks. Cognitive Science 12,299. Peterson, C. and Hartman, E. 1988. Explorations of the mean field theory learning algorithm. Neural Networks 2, 475494. Peterson, C. and Redfield, S. 1988. A novel optical neural network learning algorithm. In Proceedings of the ICO Topical Meeting on Optical Computing, J. W. Goodman, P. Chavel, and G. Roblin, eds. SPIE, Bellingham, WA, 485496. Peterson, C., Redfield, S., Keeler, J. D., and Hartman, E. 1989. Optoelectronic implementation of multilayer neural networks in a single photorefractive crystal. MCC Tech. Rep. No. ACT-ST-146-89 (to appear in Optical Engineering). Psaltis, D. and Farhat, N. 1985. Optical information processing based on associative memory model of neural nets with thresholding and feedback. Optics Letters 10,98. Psaltis, D., Wagner, K., and Brady, D. 1987. Learning in optical neurocomputers. Proceedings of the IEEE First International Conference on Neural Networks, III549; Wagner, K., and Psaltis, D. 1987. Multilayer optical learning networks. Applied Optics 3, 5061. Redfield, S. and Hesselink, B. 1988. Enhanced nondestructive holographic readout in SBN. Optics Letters 13, 880. Rumelhart, D. E. and McClelland, J. L., eds. 1986. Parallel Distributed Processing: Explorations of the Microstructure of Cognition. Vol. 1: Foundations. MIT Press, Cambridge, MA. Soffer, B. H., Dunning, G. J., Owechko, Y., and Marom, E. 1986. Associative holographic memory with feedback using phase conjugate mirrors. Optics Letters 11, 118.
Received 18 August 1989; accepted 21 December 1989.
Communicated by Joshua Alspector and Gail Carpenter
VLSI Implementation of Neural Classifiers Arun Rao Mark R. Walker Lawrence T. Clark L. A. Akers R. 0. Grondin Center for Solid State Electronics Research, Arizona State University, Tempe, AZ 85287-6206 USA
The embedding of neural networks in real-time systems performing classification and clustering tasks requires that models be implemented in hardware. A flexible, pipelined associative memory capable of operating in real-time is proposed as a hardware substrate for the emulation of neural fixed-radius clustering and binary classification schemes. This paper points out several important considerations in the development of hardware implementations. As a specific example, it is shown how the ART1 paradigm can be functionally emulated by the limited resolution pipelined architecture, in the absence of full parallelism. 1 Introduction
The problem of artificial pattern recognition is generally broken down into two sub-tasks - those of abstraction and classification. The abstraction task preprocesses the raw data from the sensors into a form suitable for the classifier (Lippmann 1987), which takes the converted (and perhaps compressed) data and makes a decision as to the nature of the input pattern. The classification task is related to the concept of clustering, which refers to the grouping of patterns into "clusters," which describe the statistical characteristics of the input. The resurgence of interest in neural-network-based approaches to the classification problem has generated several paradigms based (to a greater or lesser degree) on the human nervous system. All share the properties of being highly parallel having a multitude of simple computational elements. In addition, some are also fault-tolerant. The parallel nature of neural network classifiers makes them potentially capable of high-speed performance. However, most such classifiers exist only as computer simulations and are consequently only as parallel as the machines they run on. Neural network schemes can hence be put to efficient practical use only if good hardware implementation schemes are developed. Neural Computation 2, 35-43 (1990)
@ 1990 Massachusetts Institute of Technology
36
Arun Rao et al.
2 Neural Network Classifiers
Similarities in the computing architectures of neurally inspired clustering and classification schemes suggests that a single, flexible VLSI approach would suffice for the real-time implementation of many different algorithms. In most paradigms, the input is applied in the form of a finite-length vector. The inner product of the weights and the input is formed, generating a vector of activations representing the scores of best match calculations or projections onto orthogonal basis vectors. Subsequent processing using thresholding and/or lateral inhibition may be employed to establish input membership in classes that are specified a priori by the user, or statistical clusters and orthogonal features detected by self-organizing adaptation algorithms. Feedback may be utilized to optimize existing class detectors or to add new ones. (See Lippmann 1987 for a comprehensive review.) Common neural net classifiers that operate on binary input vectors include the Hopfield net, the Hamming net, and ARTl. Other classifiers are theoretically capable of accepting continuous input, but actual implementations usually represent continuous quantities in binary form, due to the ease with which Boolean logic may be applied for the calculation of most Euclidean distance metrics. ARTl differs substantially, since it provides mechanisms for developing class detectors in a self-organizing manner. If class samples are assumed normally distributed with equal variance, fixed-radius clustering of binary vectors may approach the accuracy of parametric Bayesian classifiers if a means of adjusting iteratively the initial location of cluster centers is provided (Kohonen 1988). The normal distribution assumption may be overly restrictive. Classifier algorithms employing a finite number of clusters of fixed radius will be suboptimal for non-Gaussian sample classes. Multilayer perceptrons employing hyperplanes to form arbitrarily complex decision regions on the multidimensional input space are better suited for such situations (Huang and Lippmann 1988). The next section describes a hardware implementation of a binary classification and clustering scheme which is functionally equivalent to any binary neural-network-based scheme. 3 A Pipelined Associative Memory Classifier
A pipelined associative memory that functions as a general minimumdistance binary pattern classifier has been designed and constructed in prototype form (Clark and Grondin 1989). This function is similar to that performed by neural network classifiers. The prototype device implements clustering based on the Hamming distance between input patterns. The basic architecture, however, could support any distance metric
VLSI Implementation of Neural Classifiers
37
that is computable from a comparison between each stored exemplar and an input bit-pattern. The nucleus of the associative memory is a pipeline of identical processing elements (PE’s), each of which performs the following functions: 1. Comparison of the present input to the stored exemplar.
2. Calculation of the distance based on this comparison. 3. Gating the best matching address and associated distance onward. Each input vector travels downward through the pipeline with its associated best matching address and distance metric. The output of the pipeline is the PE address whose exemplar most closely matches the input vector and the associated distance metric. In the event of identical Hamming distances the most recent address is preserved. This, in combination with the nonaddressed writing scheme used, means that inputs are clustered with the most recently added cluster center. Writing is a nonaddressed operation. Each PE has an associated ” u s e d bit, which is set at the time the location is written. Figure 1 is a block diagram of a single PE as implemented in the prototype (Clark 1987), and illustrates the essential components and data flow of the architecture. The write operation, which may be interspersed with compare operations without interrupting the pipeline flow, writes the first uninitialized location encountered. The operation (here a “write”), which flows down the pipeline with the data, is toggled to a ”compare.” In this manner the error condition that the pipeline is out of uninitialized locations is flagged naturally by a ”write” signal exiting the bottom of the pipe. Finally, a write generates a perfect match, so that the PE chosen is indicated by the best matching location as in a compare operation. Input data are only compared with initialized locations by gating the compare operation with the “used“ bit. The pipelined CAM bears little resemblance to conventional CAM architectures, where in the worst case, calculation of the best matching location could require 2N cycles for N bits. Some recent implementations of such devices are described by Wade and Sodini (1987) and Jones (1988). Blair (1987), however, describes an independently produced architecture that does bear significant resemblance to the pipelined CAM design. Here, the basic storage element is also a circular shift-register. Additional similarities are that uninitialized locations are masked from participating in comparisons and writing is nonaddressed. The classifier consists of the pipelined content-addressable memory with appropriate external logic to provide for comparison of the distance metric output with some threshold, and a means for feeding the output back to the input.’ If this threshold is exceeded, the input vector does ’This threshold parameter performs a function identical to that of the “vigilance parameter” used by the ART1 network described by Carpenter and Grossberg (1987a).
A r m Rao et al.
38
this PE address
best match address
t
I
t !
t'
serial data
og
9 bit
shift
4 2z
control in
Figure 1: Processing element block diagram.
not adequately match any of the stored exemplars and must be stored as the nucleus of a new cluster. The input vector (which constitutes one of the outputs of the pipeline along with the distance metric and the best matching PE address) is fed back. The input stream is interrupted for one cycle, while the fed-back vector is written to the first unused PE in the pipeline. This effectively constitutes an unsupervised "learning" process
VLSI Implementation of Neural Classifiers
39
that proceeds on a real-time basis during operation. It should be noted that this learning process is functionally identical to that performed by the unsupervised binary neural classifiers mentioned in Section 2 when they encode an input pattern in their weights. Figure 2 illustrates the overall classifier structure.
Input data vectors
t
input select MUX
Operation
Best match CAM pipeline
I
Quality metric out
Input data feedback
Output vector
Figure 2: Classifier structure.
Control
40
Arun Rao et al.
Classificationis not performed in parallel as it would be in a true parallel realization of an ANN. However, the pipelining achieves a sort of “temporal” parallelism by performing comparison operations on many input vectors simultaneously. This operation has no real anthropomorphic equivalent, but results in extremely high throughput at the expense of an output latency period that is proportional to the length of the pipeline. Architectural enhancements and overhead processing necessary to implement specific neural classifiers are facilitated by the modular nature of the device. The removal of infrequently referenced cluster centers, for example, is accomplished by the addition of a bank of registers to the control section of Figure 2. Each register would be incremented each time its associated cluster center was referenced, and would be decremented occasionally. If a register reaches zero, its associated PE would have its ”used bit toggled off. Thus relatively unused cluster centers can be eliminated, freeing space for those more often referenced (which are hopefully more indicative of the present data). This effectively emulates the function performed by the weight decay and enhancement mechanisms of ARTl. In addition, varying cluster radii would be supported by the addition of a register to each PE to control the replacement of the previous best matching address. Replacement would occur only in the case that the input was within the cluster radius indicated. 3.1 The Prototype. The prototype device was constructed using the MOSIS 3pm design rule CMOS process. Due to testing limitations, static logic was used throughout. The prototype device consists of 16 PEs handling 9-bit input vectors. Chips can be cascaded to yield longer pipelines. Longer pipelines result in higher throughput, but are accompanied by an increase in the latency period. A photomicrograph of the prototype chip is shown in Figure 3. The datapath logic of the device was found to be the limiting factor in the performance, and was tested up to 35 MHz. The corresponding pipeline bit throughput is 18.5 Mbits/sec. The latency was thus approximately 500 nsec/PE.
4 The Effect of Limited Parameter Resolution on Classifier Emulation Details regarding the implementation of specific neural models using the pipelined CAM may be found in Rao et al. (1989). In this section we seek to analyze the effects resulting from the representation of continuous network quantities with discrete binary vectors. Specifically considered is the ARTl model. Adaptive Resonance Theory (ART) is a neural network-based clustering method developed by Carpenter and Grossberg (1987a,b). Its inspiration is neurobiological and its component parts are
VLSI Implementation of Neural Classifiers
41
Figure 3: Prototype chip. intended to model a variety of hierarchical inference levels in the human brain. Neural networks based upon ART are capable of the following: 1. "Recognizing " patterns close to previously stored patterns according to some criterion.
Arun Rao et al.
42
2. Storing patterns which are not close to already stored patterns.
An analysis performed by Rao et aI. (1989) shows that the number of bits of resolution required of the bottom-up weights is given by
This is a consequence of the inverse coding rule. For most practical applications, it is impossible to achieve this resolution level in hardware. This, combined with the fact that ARTl requires full connectivity between input and classification layers, makes a direct implementation of ARTl networks impossible to achieve with current technology. The most obvious method of getting around this problem is to sacrifice parallelism to facilitate implementation. If this is done, and if in addition the inverse coding rule is eliminated by using an unambiguous distance metric (Rao et al. 19891, the ARTl network reduces to a linear search best-match algorithm functionally and structurally equivalent to the associative memory described in the previous section. The pipelined associative memory thus represents an attractive state-of-the-art method of implementing ART1-like functionality in silicon until such time that technology allows the higher degree of parallelism required for direct implementation. 5 Conclusion
The pipelined associative memory described can emulate the functions of several neural network-based classifiers. It does not incorporate as much parallelism as neural models but compensates by using an efficient, well-understood pipelining technique that allows matching operations on several input vectors simultaneously. Neural classifiers are potentially capable of high speed because of inherent parallelism. However, the problems of high interconnectivity and (especially in the case of ARTl) of high weight resolution preclude the possibility of direct implementation for nontrivial applications. In the absence of high connectivity, neural classifiers reduce to simple linear search classification mechanisms very similar to the associative memory chip described. Acknowledgments L. T. Clark was supported by the Office of Naval Research under Award N00014-85-K-0387.
VLSI Implementation of Neural Classifiers
43
References Blair, G. M. 1987. A content addressable memory with a fault-tolerance mechanism. I E E E J. Solid-state Circuits SC-22, 614-616. Carpenter, G. A., and Grossberg, S. 1987a. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput. Vision, Graphics Image Process. 37, 54-115. Carpenter, G. A., and Grossberg, S. 1987b. ART2: Self-organization of stable category recognition codes for analog input patterns. Appl. Opt.: Special Issue on Neural Networks. Clark, L. T. 1987. A novel VLSI architecture for cognitive applications. Masters Thesis, Arizona State University. Clark, L. T., and Grondin, R. 0. 1989. A pipelined associative memory implemented in VLSI. I E E E I. Solid-state Circuits 24, 28-34. Huang, W. Y., and Lippmann, R. P. 1988. Neural net and traditional classifiers. Neural lnformation Processing Systems, pp. 387-396. American Institute of Physics, New York. Jones, S. 1988. Design, selection and implementation of a content-addressable memory for a VLSI CMOS chip architecture. I E E E Proc. 135, 165-172. Kohonen, T. 1988. Learning vector quantization. Abstracts of the First Annual INNS Meeting, p. 303. Pergamon Press, New York. Lippmann, R. P. 1987. An introduction to computing with neural nets. l E E E ASSP Mag. 4-22. Rao, A., Walker, M. R., Clark, L. T., and Akers, L. A. 1989. Integrated circuit emulation of ART1 Networks. Proc. First I E E E Conf. Artificial Neural Networks, 3741. Institution of Electrical Engineers, London. Wade, J., and Sodini, C. 1987. Dynamic cross-coupled bit-line content addressable memory cell for high-density arrays. I E E E J. Solid-state Circuits SC-22, 119-121.
Received 31 March 1989; accepted 21 December 1989.
Communicated by Richard A. Andersen
Coherent Compound Motion: Corners and Nonrigid Configurations Steven W.Zucker* Lee Iverson Robert A. Hummelt Computer Vision and Robotics Laboratory, McGilJ Research Centre for Intelligent Machines, McGill University, Montrhl, QuCbec H3A 2A 7 Canada
Consider two wire gratings, superimposed and moving across each other. Under certain conditions the two gratings will cohere into a single, compound pattern, which will appear to be moving in another direction. Such coherent motion patterns have been studied for sinusoidal component gratings, and give rise to percepts of rigid, planar motions. In this paper we show how to construct coherent motion displays that give rise to nonuniform, nonrigid, and nonplanar percepts. Most significantly, they also can define percepts with comers. Since these patterns are more consistent with the structure of natural scenes than rigid sinusoidal gratings, they stand as interesting stimuli for both computational and physiological studies. To illustrate, our display with sharp comers (tangent discontinuities or singularities) separating regions of coherent motion suggests that smoothing does not cross tangent discontinuities, a point that argues against existing (regularization) algorithms for computing motion. This leads us to consider how singularities can be confronted directly within optical flow computations, and we conclude with two hypotheses: (1)that singularities are represented within the motion system as multiple directions at the same retinotopic location; and (2) for component gratings to cohere, they must be at the same depth from the viewer. Both hypotheses have implications for the neural computation of coherent motion. 1 Introduction
Imagine waves opening onto a beach. Although the dominant physical direction is inward, the visual impression is of strong lateral movement. This impression derives from the interaction between the crests of waves *Fellow,Canadian Institute for Advanced Research. +Courant Institute of Mathematical Sciences and Center for Neural Science, New York University, New York, NY, and Vrije University, Amsterdam, The Netherlands.
Neural Computation 2 , 4 4 5 7 (1990)
@ 1990 Massachusetts Institute of Technology
Coherent Compound Motion
45
adjacent in time, and is an instance of a much more general phenomenon: whenever partially overlapping (or occluding) objects move with respect to one another, the point where their bounding contours intersect creates a singularity (Zucker and Casson 1985). Under certain conditions this singularity represents a point where the two motions can cohere into a compound percept, and therefore carries information about possible occlusion and relative movement. Another example is the motion of the point of contact between the blades of a closing scissors; the singular point moves toward the tip as the scissors are closed. The scissors example illustrates a key point about coherent motion: hold the scissors in one position and observe that it is possible to leave the singular point in two different ways, by traveling in one direction onto one blade, or in another direction onto the other blade. Differentially this corresponds to taking a limit, and intuitively leads to thinking of representing the singular point as a point at which the contour has two tangents. Such is precisely the representation we have suggested for tangent discontinuities in early vision (Zucker et al. 1989), and one of our goals in this paper is to show how it can be extended to coherent motion computations. The previous discussion was focused on two one-dimensional contours coming together, and we now extend the notions of singular points and coherent motion to two-dimensional (texture) patterns. In particular, if a ”screen” of parallel diagonal lines is superimposed onto a pattern of lines at a different orientation, then a full array of intersections (or singular points) can be created. The proviso, of course, is that the two patterns be at about the same depth; otherwise they could appear as two semitransparent sheets. Adelson and Movshon (1982) extended such constructions into motion, and, using sinusoidal gratings, showed that coherent compound motion can arise if one pattern is moved relative to the other. To illustrate, suppose one grid is slanted to the left of the vertical, the other to the right, and that they are moving horizontally in opposite directions. The compound motions of each singular point will then cohere into the percept of a rigid texture moving vertically. Thus the compound pattern can be analyzed in terms of its component parts. But compound motion arises in more natural situations as well, and gives rise to coherent motion that is neither rigid nor uniform. Again to illustrate, superimposed patterns often arise in two different ways in densely arbored forest habitats (e.g., Fig. 2 in Allman et al. 1985). First, consider an object (say a predator) with oriented surface markings lurking in the trees; the predator’s surface markings interact with the local orientation of the foliage to create a locus of singular points. A slight movement on either part would create compound motion at these points, which would then cohere into the predator’s image. Thus, singular points and coherent motion are useful for separating figure from ground. More complex examples arise in this same way, e.g., between nearby trees
46
Steven W. Zucker, Lee Iverson, and Robert R. Hummel
undergoing flexible or different motions, and suggests that natural coherent motion should not be limited to that arising from rigid, planar objects; nonrigid and singular configurations should arise as well. Second, different layers of forest will interact to create textures of coherent motion under both local motion and an observer's movement; distinguishing these coherent motion displays from (planar) single textures (e.g., a wallpaper pattern) can depend on depth as well. Thus, nonrigid pattern deformations, discontinuities, and depth matter, and our first contribution in this paper is to introduce a new class of visual stimuli for exhibiting them. The stimuli build on the planar, rigid ones previously studied by Adelson and Movshon (1982), but significantly enlarge the possibilities for psychophysical, physiological, and computational studies. In particular, the perceptual salience of "corners" within them implies that algorithms for the neural computation of coherent motion require significant modification from those currently available. We propose that a multiple tangent representation, known to be sufficent to represent tangent discontinuities in curves, can be extended to handle them, and show how such ideas are consistent with the physiology of visual area MT. Finally, the interaction of motion coherence and depth is briefly considered. 2 Nonuniform Coherent Motion Displays
Compound motion displays are created from two patterns, denoted PI and P2 where, for the Adelson and Movshon displays, the P, were sinusoidal gratings oriented at 0: and 0;, respectively. Observe initially that patterns of parallel curves work as well as the sinusoidal gratings; the components can be thought of as square waves (alternating black and white stripes) oriented at different angles. Parallel random dot Moire patterns ("Glass patterns") work as well (Zucker and Iverson 19871, and we now show that patterns that are not uniformly linear also work. It is this new variation (in the orthogonal direction) that introduces nonuniformities into the coherent motion display. We consider two nonuniform patterns, one based on a sinusoidal variation, and the other triangular. As we show, these illustrate the variety of nonrigid and singular patterns that can arise naturally. 2.1 Sinusoidal Variation. The first nonuniform compound motion pattern is made by replacing one of the constant patterns with a variable one, say a grating composed of parallel sinusoids rather than lines (Fig. 1). Note that this is different than the Adelson and Movshon display, because now the sinusoidal variation is in position and not in intensity. Sliding the patterns across one another, the result is a nonconstant motion field
Coherent Compound Motion
47
Figure 1: Illustration of the construction of smooth but nonuniform coherent motion patterns. The first component pattern (left) consists of a field of displaced sinusoinal curves, oriented at a positive angle (with respect to the vertical), while the second component consists of displaced parallel lines (right) oriented at a negative angle. The two patterns move across each other in opposite directions, e.g., pattern (left) is moved to the left, while pattern (right) is moved to the right. Other smooth functions could be substituted for either of these.
(Fig. 2), for which three coherent interpretations are possible (in addition to the noncoherent, transparent one):
1. Two-dimensional sliding swaths, or a flat display in which the compound motion pattern appears to be a flat, but nonrigid rubber sheet that is deforming into a series of alternating wide strips, or swaths, each of which moves u p and down at what appears to be a constant rate with “elastic” interfaces between the strips. The swath either moves rapidly or slowly, depending on the orientation of the sinusoid, and the interfaces between the swaths appear to stretch in a manner resembling viscous flow. The situation here is the optical flow analog of placing edges between the “bright” and the “ d a r k swaths on a sinusoidal intensity grating. 2. Three-dimensional compound grating, in which the display appears to be a sinsusoidally shaped staircase surface in depth on which a crosshatched pattern has been painted. The staircase appears rigid, and the cross-hatched pattern moves uniformly back and forth across it. Or, to visualize it, imagine a rubber sheet on which two bar
48
Steven W. Zucker, Lee Iverson, and Robert R. Hummel gratings have been painted to form a cross-hatched grating. Now, let a sinusoidally shaped set of rollers be brought in from behind, and let the rubber sheet be stretched over the rollers. The apparent motion corresponds to the sheet moving back and forth over the rollers. 3. Three-dimensional individual patterns, in which the display appears as in (2), but only with the sinusoidal component painted onto the staircase surface. The second, linear grating appears separate, as if it were projected from a different angle. To illustrate with an intuitive example, imagine a sinusoidal hill, with trees casting long, straight shadows diagonally across it. The sinusoid then appears to be rigidly attached to the hill, while the "shadows" appear to move across it.
Figure 2: Calculation of the flow fields for the patterns in Figure 1: (upper left) the normal velocity to the sinusoidal pattern; (upper right) the normal velocity to the line pattern; (bottom)the compound velocity.
Coherent Compound Motion
49
Figure 3: Illustration of the construction of nonuniform coherent motion patterns with discontinuities. The first component pattern (left) consists of a field of displaced triangular curves, oriented at a positive angle, while the second component again consists of displaced parallel lines (right) oriented at a negative angle. The two patterns move across each other as before. Again, other functions involving discontinuities could be substituted for these.
2.2 Triangular Variation. Replacing the sinusiodal grating with a triangular one illustrates the emergence of percepts with discontinuities, or sharp corners. The same three percepts are possible, under the same display conditions, except the smooth patterns in depth now have abrupt changes, and the swaths in (1)have clean segmentation boundaries between them (see Figs. 3 and 4). Such discontinuity boundaries are particularly salient, and differ qualitatively from patterns with high curvatures in them (e.g., high-frequency sinusoids). The subjective impression is as if the sinusoidal patterns give rise to an elastic percept, in which the imaged object stretches and compresses according to curvature, while the triangular patterns give rise to sharp discontinuities.
2.3 Perceptual Instability. To determine which of these three possible percepts are actually seen, we implemented the above displays on a Symbolics 3650 Lisp Machine. Patterns were viewed on the console as black dots on a bright white background, with the sinusoid (or triangular wave) constructed as in Figure 1. The patterns were viewed informally by more than 10 subjects, either graduate students or visitors to the laboratory, and all reported a spontaneous shift from one percept to another. Percepts (1) and (2) seemed to be more common than (31, but individual variation was significant. The spontaneous shifts from
50
Steven W. Zucker, Lee Iverson, and Robert R. Hummel
one percept to another were qualitatively unchanged for variations in amplitude, frequency, and line width (or dot size, if the displays were made with Glass patterns) of about an order of magnitude. The shifting from one percept to another was not unlike the shifts experienced with Necker cube displays. Many subjects reported that eye movements could contribute to shifts between percepts, and also that tracking an element of one component display usually leads to a percept of two transparent sheets.
Figure 4: Calculation of the flow fields for the patterns in Figure 3: (upper left) the normal velocity to the triangular pattern; (upper right) the normal velocity to the line pattern; (bottom) the compound velocity. Velocity vectors at the singular points of the triangle component are shown with small open circles, indicating that two directions are associated with each such point. In the bottom figure, the open circles indicate what we refer to as the singular points of coherent motion, or those positions at which two compound motion vectors are attached.
51
Coherent Compound Motion 3 Local Analysis of Moving Intersections
Given the existence of patterns that exhibit nonuniform compound motion, we now show how the characterization of rigid compound motion can be extended to include them. To begin, observe that one may think of compound motion displays either as raw patterns that interact, or as patterns of moving "intersections" that arise from these interactions. Concentrate now on the intersections, and imagine a pattern consisting of gratings of arbitrarily high frequency, so that the individual undulations shrink to lines. Each intersection is then defined by two lines, and the distribution of intersections is dense over the image. (Of course, in realistic situations only a discrete approximation to such dense distributions of intersections would obtain.) Now, concentrate on a typical intersection, whose motion we shall calculate. (Observe that this holds for each point in the compound image.) The equations for the lines meeting at a typical intersection x = ( 2 ,y) can be written nl . x = c1 + w l t n2 ' x = c 2
+ v2t
where n,, i = 1 , 2 are the normals to the lines in the first and the second patterns, respectively, c, are their intercepts, and w, are their (normal) velocities. Observe that the simultaneous solution of these equations is equivalent to the Adelson and Movshon (1982) "intersection of constraints" algorithm (their Fig. 1). In matrix form we have
(;: 3(;If",'> (::3 =
We can rewrite this equation as N x ( t ) = c + tv Differentiating both sides with respect to t, we obtain NX(t) = v
where v = (q, 7~2)~. Thus, the velocity of the intersection of two moving lines can be obtained as the solution to a matrix equation, and is as follows (from Cramer's rule):
where A is the matrix determinant, A
Y(t)l.
= n11n22
-
721272221, and X ( t ) =
I.k(t),
52
Steven W. Zucker, Lee Iverson, and Robert R. Hummel
Several special cases deserve comment. Let (nll, 1212) = nl and = n 2 . Now, suppose that nl and n2 are perpendicular, so that the two lines meet at a right angle. Once again, assume that the normal velocities of the two lines are w1 and w2, respectively. Then the velocity of the intersection is readily seen to be the vector sum of the velocities of the two Lines: (~~21,1222)
x(t) =
211
. nl
+
w2. n2
The simplest case involving a distribution of intersections is two sets of parallel lines, each set having orientations given by nl and n2, and moving with a uniform (normal) velocity of v1 and 212, respectively. Then all of the intersections will have the same velocity, given by the solution x to the matrix equation. As Adelson and Movshon found, the overall percept in this situation is a uniform motion of x. More generally, as we showed there may be many lines and edges, oriented and moving as a function of their position. Thus, there will be many intersections moving according to the above matrix equation. If the line elements (with normals nl and n2)are associated with objects that are themselves moving with velocities v 1 and v2, then the normal velocities of the lines are obtained from the projections w1 = v 1 . nl and v2 = v 2 . n2. The velocity of the intersection point then satisfies the matrix equation NX = (w1,~ 2 at ) each instant t. Thus reliable estimates of the velocity and normal (or tangent) at each position are integral to compound motion. 4 Implications for Neural Computation It is widely held that theories of motion need not treat discontinuities directly, and that "segmentation" is a problem separate from motion. This view has lead to a rash of "regularized" algorithms with three key features: (1) smoothing is uniform and unconditional; (2) single, unique values are demanded as the solution at each point; and (3) discontinuities, if addressed at all, are relegated to an adjoint process (Bulthoff et al. 1989; Wang et al. 1989; Yuille and Grzywacz 1988). We believe that all three of these features need modification, and submit that the current demonstrations are evidence against them; if regularization-style algorithms were applied to the sine- and the triangle-wave coherent motion patterns, the smoothing would obscure the differences between them. To properly treat these examples, algorithms must be found in which discontinuities are naturally localized and smoothing does not cross over them. Furthermore, we question whether single values should be required at each position, and submit that representations permitting multiple values at a position should be sought. Such multiple-valued representations are natural for transparency, and, as we show below, are natural for representing discontinuities (in orientation and optical flow) as well.
Coherent Compound Motion
53
Before beginning, however, we must stress that there is not yet sufficent information to state precisely how the computation of compound motion is carried out physiologically, or what the precise constraints are for coherence. The analysis in the previous section represents an idealized mathematical competence, and its relationship to biology remains to be determined. Nevertheless, several observations are suggestive. First, it indicates that one need not try to implement the graphical version of the Adelson and Movshon (1982) ”intersection of constraints” algorithm literally, but, now that the mathematical requirements are given, any number of different implementations become viable formally. Biologically it is likely that the computation invokes several stages, and the evidence is that initial measurements of optical flow are provided by cells whose receptive fields resemble space-time filters, tuned for possible directions of (normal) motion (Movshon et al. 1985). Abstractly the filters can be thought of as being implemented by (e.g.) Gabor functions, truncated to local regions of space-time. Such filters provide a degree of smoothing, which is useful in removing image quantization and related affects, but which also blurs across distinctions about which filter (or filters) is (are) signaling the actual motion at each point. In fact, because of their broad tuning, many are usually firing to some extent. An additional selection process is thus required to refine these confused signals, and it is in this selection process that the inappropriate regularization has been postulated to take place. To illustrate, a selection procedure for compound motion was proposed by Heeger (1988) from the observation that a translating compound grating occupies a tilted plane in the frequency domain. (This comes from the fact that each translating sinusoidal grating occupies a pair of points along a line in spatial-frequency space; the plane is defined by two lines, one from each component grating.) After transforming the Gabor filters’ responses into energy terms, Heeger ’s selection process reduces to fitting a plane. However, the fitting cannot be done pointwise; rather, an average is taken over a neighborhood, effectively smoothing nearby values together. This is permissible in some cases, e g , for the planar, rigid patterns that Adelson and Movshon studied. But it will fail for the examples in this paper, rounding off the corners within the triangle waves. It cannot handle transparency either, because a single value is enforced at each point (only one plane can be fit). Other variations in this same spirit, based on Tikhonov regularization or other ad hoc ( e g , “winner-take-all”) ideas, differ in the averaging that they employ, but still impose smoothness and single-valuedness on the solution (Bulthoff e t al. 1989; Wang et al. 1989; Yuille and Grzywacz 1988). They cannot work in general. A different variation on the selection procedure relaxes the requirement that only a single value be assigned to each position, incorporates a highly nonlinear type of smoothing, and is designed to confront discontinuities directly. It is best introduced by analogy with orientation selection
54
Steven W. Zucker, Lee Iverson, and Robert R. Hummel
(Zucker and Iverson 1987). Consider a static triangle wave. Zucker et al. (1988, 1989) propose that the goal of orientation selection is a coarse description of the local differential structure, through curvature, at each position. It is achieved by postulating an iterative, possibly intercolumnar, process to refine the initial Orientation estimates (analogous to the initial motion measurements) by maximizing a functional that captures how the local differential estimates fit together. This is done by partitioning all possible curves into a finite number of equivalence classes, and then evaluating support for each of them independently. An important consequence of this algorithm is that, if more than one equivalence class is supported by the image data at a single point, then both enter the final representation at that point. This is precisely what happens at a tangent discontinuity, with the supported equivalence classes containing the curves leading into the discontinuity (example: a static version of the scissors example in the Introduction). Mathematically this corresponds to the Zariski tangent space (Shafarevich 1977); and physiologically the multiple values at a single point could be implemented by multiple neurons (with different preferred orientations) firing within a single orientation hypercolumn. Now, observe that this is precisely the structure obtained for the coherent motion patterns in the Introduction to this paper -singular points are defined by two orientations, each of which could give rise to a compound motion direction. Hence we propose that multiple motion direction vectors are associated with the points of discontinuity, that is, with the corners of the triangle waves, and that it is these singularities that give rise to the corners in coherent motion. Stating the point more formally, the singular points (or corners) of coherent motion derive directly from the singular points (or corners) of component motion, and for exactly the same reason. These points are illustrated in Figure 4 (bottom) by the open circles. But for such a scheme to be tractable physiologically, we require a neural architecture capable of supporting multiple values at a single retinotopic position. The evidence supports this, since (1) compound motion may well be computed within visual area MT (Movshon et al. 1985; Rodman and Albright 1988), and (2) there is a columnar organization (around direction of motion) in MT to support multiple values (i.e., there could be multiple neurons firing within a direction-of-motion hypercolumn) (Albright et al. 1984). Before such a scheme could be viable, however, a more subtle requirement needs to be stressed. The tuning characteristic for a directionselective neuron is typically broad, suggesting that multiple neurons would typically be firing within a hypercolumn. Therefore, exactly as in orientation selection, some nonlocal processing would be necessary to focus the firing activity, and to constrain multiple firings to singularities. In orientation selection we proposed that these nonlocal interactions be implemented as intercolumnar interactions (Zucker et af. 1989); and, again by analogy, now suggest that these nonlocal interactions exist for
Coherent Compound Motion
55
compound motion as well. That they further provide the basis for interpolation (Zucker and Iverson 1987) and for defining regions of coherence also seems likely. The triangle-wave example deserves special attention, since it provides a bridge between the orientation selection and optical flow computations. In particular, for nonsingular points on the triangle wave, there is a single orientation and a single direction-of-motion vector. Thus the compound motion computation can run normally. However, at the singular points of the triangle wave there are two orientations (call them n, and no); each of these defines a compound motion with the diagonal component (denoted simply n). Thus, in mathematical terms, there are three possible ways to formulate the matrix equation, with (n,, n),(no,n), and (no,q). The solutions to the first two problems define the two compound motion vectors that define the corner, while the third combination simply gives the translation of the triangle wave at the singular point. In summary, we have: Conjecture 1. Singularities are represented in visual area M T analogously to the way they are represented in V1; that is, via the activity of multiple neurons representing different direction-of-motion vectors at about the same (retinotopic) loca tion. We thus have that coherent pattern motion involves multiple data concerning orientation and direction at a single retinotopic location, but there is still a remaining question of depth. That depth likely plays a role was argued in the Introduction, but formally enters as follows. Recall that the tilted plane for rigid compound motion (e.g., in Heeger’s algorithm) resulted from the combination of component gratings. But a necessary condition for physical components to belong to the same physical object is that they be at the same depth, otherwise a figure/ground or transparency configuration should obtain. MT neurons are known to be sensitive to depth, and Allman et al. (1985) have speculated that interactions between depth and motion exist. We now refine this speculation to the conjecture that Conjecture 2. The subpopulation of MT neurons that responds to compound motion agrees with the subpopulation that is sensitive to zero (or to equivalent) disparity. There is some indirect evidence in support of this conjecture, in that Movshon et al. (1985) (see also Rodman and Albright 1988) have reported that only a subpopulation of MT neurons responds to compound pattern motion, and Maunsell and Van Essen (1983) have reported that a subpopulation of MT neurones is tuned to zero disparity. Perhaps these are the same subpopulations. Otherwise more complex computations relating depth and coherent motion will be required. As a final point, observe that all of the analysis of compound motion was done in terms of optical flow, or the projection of the (3-D) velocities
56
Steven W. Zucker, Lee Iverson, and Robert R. Hummel
onto the image plane, yet two of the three possible percepts reported by subjects were three-dimensional. If there is another stage at which these 3-D percepts are synthesized, they could certainly take advantage of the notion that discontinuities are represented as multiple values at a point; each value then serves as the boundary condition for one of the surfaces meeting at the corner.
Acknowledgments This research was sponsored by NSERC Grant A4470 and by AFOSR Grant 89-0260. We thank Allan Dobbins for comments.
References Adelson, E. H., and Movshon, J. A. 1982. Phenomenal coherence of moving visual patterns. Nature (London) 200, 523-525. Albright, T. L., Desimone, R., and Gross, C. 1984. Columnar organization of directionally selective cells in visual area MT of the macaque. J. Neurophysiol. 51, 16-31. Allman, J., Miezin, F., and McGuinness, E. 1985. Direction- and velocity-specific responses from beyond the classical receptive field in the middle temporal area (MT). Perception 14,85-105. Bulthoff, H., Little, J., and Poggio, T. 1989. A parallel algorithm for real time computation of optical flow. Nature (London) 337,549-553. Heeger, D. 1988. Optical flow from spatio-temporal filters. Int. J. Comput. Vision 1,279-302. Maunsell, J. H. R. and Van Essen, D. 1983. Functional properties of neurons in middle temporal visual area of macaque monkey. 11. Binocular interactions and sensitivity to binocular disparity. J. Neurophysiol. 49, 1148-1167. Movshon, J. A., Adelson, E. H., Gizzi, M. S., and Newsome, W. T. 1985. The analysis of moving visual patterns. In Study Group on Pattern Recognition Mechanisms, C. Chagas, R. Gattass, and C. Gross, eds. Pontifica Academia Scientiarum, Vatican City. Rodman, H. and Albright, T. 1988. Single-unit analysis of pattern-motion selective properties in the middle temporal visual area (MT). Preprint, Dept. of Psychology, Princeton University, Princeton, NJ. Shafarevich, I. R. 1977. Basic Algebraic Geometry. Springer-Verlag, New York. Wang, H. T., Mathur, B., and Koch, C. 1989. Computing optical flow in the primate visual system. Neural Comp. 1,92-103. Yuille, A. and Grzywacz, N. 1988. The motion coherence theory, Proc. Second Int. Conf. Comput. Vision, IEEE Catalog No. 88CH2664-1, Tarpon Springs, FL, pp. 344-353. Zucker, S. W. and Casson, Y. 1985. Sensitivity to change in early optical flow. Invest. Ophthalmol. Visual Sci. (Suppl.) 26(3), 57. Zucker, S. W., Dobbins, A., and Iverson, L. 1989. Two stages of curve detection suggest two styles of visual computation. Neural Comp. 1, 68-81.
Coherent Compound Motion
57
Zucker, S. W., and Iverson, L. 1987. From orientation selection to optical flow. Cornput. Vision, Graphics, Image Process. 37, 196-220. Zucker, S. W., David, C., Dobbins, A., and Iverson, L. 1988. The organization of curve detection: Coarse tangent fields and fine spline coverings. Proc. Second Int. Conf. Cornput. Vision, IEEE catalog no. 88CH2664-1, 568-577. Tarpon Springs, FL.
Received 10 July 1989; accepted 8 January 1990.
Communicated by Gordon M. Shepherd
A Complementarity Mechanism for Enhanced Pattern Processing James L. Adams* Neuroscience Program, 73-346 Brain Research Institute, University of California, Los Angeles, CA 90024 USA
The parallel ON- and OFF-center signals flowing from retina to brain suggest the operation of a complementarity mechanism. This paper shows what such a mechanism can do in higher-level visual processing. In the proposed mechanism, inhibition and excitation, both feedforward, coequally compete within each hierarchical level to discriminate patterns. A computer model tests complementarity in the context of an adaptive, self-regulating system. Three other mechanisms (gain control, cooperativity, and adaptive error control) are included in the model but are described only briefly. Results from simulations show that complementarity markedly improves both speed and accuracy in pattern learning and recognition. This mechanism may serve not only vision but other types of processing in the brain as well. 1 Introduction
We know that the human genome contains a total of about 100,000 genes. Only a fraction of that total relates to vision. So few genes cannot explicitly encode the trillions of connections of the billions of neurons comprising the anatomical structures of vision. Hence, general mechanisms must guide its development and operation, but they remain largely unknown. The objective of the research reported in this paper was to deduce and test mechanisms that support stability and accuracy in visual pattern processing. The research combined analysis of prior experimental discoveries with synthesis of mechanisms based on considerations of the requirements of a balanced system. Several mechanisms were proposed. One of these, complementarity, is presented in this paper. 2 Foundations and Theory
Beginning with traditional aspects of vision models, I assume that patterns are processed hierarchically. Although this has not been conclu*Current address: Department of Neurobiology, Barrow Neurological Institute, 350 W. Thomas Road, Phoenix, AZ 85013-4496 USA.
Neural Computation 2, 58-70 (1990)
@ 1990 Massachusetts Institute of Technology
A Complementarity Mechanism for Enhanced Pattern Processing
59
sively demonstrated (Van Essen 1985),most experimental evidence shows selectivity for simple features at early stages of visual processing and for complex features or patterns at later stages.' I also assume that signals representing lower order visual features converge to yield higher order features (Hubel and Wiesel 1977). Inasmuch as the retina is a specialized extension of the brain itself, we might expect to find mechanisms common to both. Consequently, I partially base the complementarity mechanism of this paper on certain well-established findings from the retina. Retinal photoreceptors respond in proportion to the logarithm of the intensity of the light striking them (Werblin 1973). Since the difference between two logarithmic responses is mathematically identical to the logarithm of the ratio of the two stimulus intensities, relative intensities can be measured by simple summation of excitatory and inhibitory signals. The ratios of intensities reflected from different portions of an object are characteristic of the object and remain fairly constant for different ambient light intensities.* Use of this property simplifies pattern learning and recognition. Prediction 1. Constant ratios and overlapping feedforward. A given feature at a particular location within the visual field is accurately represented under varying stimulus intensities if (1) the neurons responding to that feature and those responding to the complement of that feature fire at the same relative rates for different levels of illumination3 and (2) the ratios of responses from different parts of an object are obtained via overlapping inhibitory and excitatory feedforward projections. Photoreceptor cells depolarize in darkness and hyperpolarize in light, generating the strongest graded potentials in darkness (Hagins 1979). Even so, a majority of the tonic responses to illumination transmitted from the retinal ganglion cells to the lateral geniculate nucleus represent two complementary conditions: (1) excitation by a small spot of light of a particular color and inhibition by a surround of a contrasting color or ( 2 ) inhibition by a center spot and excitation by a contrasting surround (De Monasterio and Gouras 1975). Such ON-center and OFFcenter signals vary from cell to cell, most reporting contrasts in color and others reporting contrasts in light intensity. ON- and OFF-center activities are maintained separately at least as far as the visual cortex (Schiller et al. 1986).
lThe term "feature" is used in this paper to refer to a configuration of stimuli or signals for which one or more pattern processing neurons optimally respond. 'This follows from a law of physics that the light reflected from a surface is linearly proportional to the intensity of the incident light. 3Some examples of low-level complementary features are dark versus bright (opposites in intensity), vertical versus horizontal lines (geometrically perpendicular), and red versus green or yellow versus blue (complementary colors).
60
James L. Adams
Prediction 2. Constant sums. If a cortical neuron that responds to a stimulus at a given position within a sensory receptive field is most active for a particular stimulus, there is another neuron nearby4 that is least active for that same stimulus. Similarly, a stimulus provoking a high rate of discharge in the second neuron results in a low rate of discharge in the first. The sum of their activities remains approximately constant for a wide range of stimuli. Corollary Prediction 2a. Constant sums for complementary features. With a fixed level of illumination, as a feature at a given location is transformed into the complement of that feature (e.g., rotation of a vertical edge to a horizontal position), the sum of the firing rates of the cortical neurons optimal for the original feature and those optimal for its complement remains constant. The relative levels of activity of the two shift inversely with the change in character of the stimulus, one group becoming stronger and the other becoming weaker.
Prediction 3. System balance. In order that complementary features compete equally, each feature triggers equal amounts of parallel excitatory and inhibitory feedforward activity5 While some neurons are excited in response to a particular feature, others are inhibited, and vice versa. This also contributes to the overall balance of the system.6 In addition to an overall dynamic balance of excitatory and inhibitory activity, consider all connections, both active and inactive, per neuron. In a network with the total strengths of excitatory and inhibitory connections balanced only globally, or even locally, we could expect to find some neurons with mostly excitatory connections and others with mostly inhibitory connections. However, that is contrary both to what has been observed in studies of connections in the cerebral cortex (e.g., see Lund 1981) and to what can be functionally predicted.
Prediction 4. Neuronal connection balance. On each neuron, the feedforward input connections are balanced such that the total (active and inac-
4Within the local region of cortex serving the same sensory modality and the same position in the receptive field. 5 A continuous balance is more easily achieved in a system of parallel feedforward excitation and inhibition than in one of temporally alternating feedforward excitation and lateral inhibition. 60rdinarily, the firing of an individual neuron has little effect on neighboring neurons not connected to it. The shift of ions across the membrane of the firing cell is shunted away from the neighboring cells by the extracellular fluid. Hypothetically, though, a problem might arise if a high percentage of neurons within a neighborhood received only excitation and some of them fired in synchrony. The shifts in potential could be sufficient to induce spurious firings of neighboring neurons before the shunting was completed. Such a problem would not occur, however, in a system in which approximately equal amounts of excitation and inhibition arrived at any given instant within every small region of neural circuitry. The parallel feedforward described in this prediction could provide the required balance.
A Complementarity Mechanism for Enhanced Pattern Processing
61
tive) excitatory input strength equals the total inhibitory input ~trength.~ Although the strongest experimental evidence for a complementarity mechanism has been found in the lowest processing levels of vision, there is no a priori reason to believe it will not be found in higher processing levels as well. Prediction 5. Complementarity in successive processing levels. In each hierarchical level of pattern processing, new complementarities are formed. One cannot expect the brain to achieve perfect complementarity at all times. Nevertheless, one can predict that self-adjusting feedback mechanisms continually restore the nervous system toward such equilibrium. Dynamically, this process can be compared with homeostasis. The agents for the competition between complementary signals in the retina are the bipolar, horizontal, and amacrine cells (Dowling 1979). In the cerebral cortex, the agents for the competition may be excitatory and inhibitory stellate interneurons, although much less is known about the functional roles of neurons in the cortex than in the retina. Stellate neurons occur throughout all layers of the cortex but are especially numerous in input layer IV (Lorente de N6 1949). Whether or not they are the exact agents for competing complementary signals in the cortex, they are well-placed for providing parallel excitatory and inhibitory feedforward signals from the specific input projections to the output pyramidal neurons. 3 Rules for Implementing Complementarity
Mechanistically, complementarity can be achieved throughout the processing hierarchy by implementing the following rules: 1. In a processing hierarchy, let the output projections from one level of processing excite both excitatory and inhibitory interneurons that form plastic connections onto the feature-selective neurons of the next higher level.
2. On those neurons that receive plastic connections: adjust both active and inactive connections. If the net activation is excitatory, simultaneously reinforce the activated excitatory input connections and the nonactivated inhibitory input connections? Similarly, if the 7The firing of such a neuron depends upon the instantaneous predominance of either excitation or inhibition among its active connections. 'Connections whose strengths change with increasing specificity of responses. gA connection is "activated" if it has received a recent input spike and certain undefined responses have not yet decayed. Nonactivated connections respond to some feature(s) other than the one currently activating the other input connections.
James L. Adams
62
net activation is inhibitory, simultaneously reinforce the activated inhibitory input connections and the nonactivated excitatory input connections.
3. Maintain a fixed total strength of input plastic connections onto any given neuron. For each neuron, require half of that total strength be excitatory and half inhibitory. The requirement for fixed total input connection strengths is met by making some connections weaker as others are made stronger. Those made weaker are the ones not satisfying the criteria for reinforcement; thus, they lose strength passively. 4 Network for Testing the Theory
The complementarity mechanism was tested with a three-layer network of 1488 simulated neurons (see Fig. 1). This particular network accepts a 12 x 12 input array of 144 binary values. The neuron types important to the processing by the network and their prototypical connectivity are shown in Figure 2. Each simulated neuron has a membrane area on which the positive (excitatory) and negative (inhibitory)input signals are integrated. The neurons are connected with strengths (weights) that vary with the reinforcement history of each neuron. The connection strength is one of the determinants of the amount of charge delivered with each received input spike. Another determinant is the gain, used to make the network self-regulating in its modular firing rate. That is, the gain is automatically adjusted up or down as necessary to move each module toward a certain number of output spikes generated for a given number of input spikes received. The gain tends toward lower values as a module becomes more specific in its responses. The regulatory feedforward neurons (RFF in Fig. 2) modulate the changes in connection strengths whenever an error occurs. (An error is defined to be the firing of an output neuron in response to a combination of input signals other than that to which it has become ”specialized.”) This modulation shifts the weights in a manner that reduces the future likelihood of erroneous firing. The error detection and correction occur without an external teacher. That, however, is not the subject of this paper.’O A neuron fires when its membrane potential (charge per unit area) exceeds a predetermined threshold level. Upon firing, the neuron’s membrane potential is reset to the resting level (each cycle of the simulation represents about 10 msec), and a new period of integration begins for that neuron. Based on an assumption of reset action by basket cells, a modular, winner-take-all scheme resets the potentials of all neurons within a module whenever a single output neuron in that module fires. ‘OFor further information on the error control mechanism, see Adams (1989).
A Complementarity Mechanism for Enhanced Pattern Processing
63
Signals are transmitted from one neuron to succeeding neurons via axons. Referring to Figure 2, each input axon to a module makes fixedstrength excitatory connections to one inhibitory stellate neuron and one excitatory stellate. It also makes a fixed-strength inhibitory connection to one RFF neuron, omitting an inhibitory interneuron for simplicity. Every stellate neuron within a module connects to every primary pyramidal (P) neuron within that module. These connections are initialized with random strengths. The strength of each can evolve within a range of zero u p to a maximum of about one-quarter the total-supportable input synaptic strength of the postsynaptic cell. In practice, the actual strengths that develop fall well between the two extremes. It is the
PL2
PLl
IA
Figure 1: Network layout. The input array (IA) accepts the binary input and generates axon spikes that project directly without fan-out or fan-in to the simulated neural elements of processing level 1 (PL1).PL1 is divided into 16 modules representing hypercolumns as defined by Hubel and Wiesel (1977). The output from PLl converges onto a single module in processing level 2 (PL2). Each dashed line represents 9 axons (144 axons project from IA to PLl and another 144 from PLl to PL2). Each moduIe of PL1 generates local feature identifications and the single module of PL2 generates pattern identifications.
James L. Adams
64
Q Pyramidal neurons ( p r i m a r y and secondary)
0
Excitatory interneurons
I n h i b i t o r y interneuron
*
Regulatory interneuron
Figure 2: Basic neuronal layout of levels PLl and PL2. Only prototypical connections are shown. The IS and ES neurons connect to all primary pyramidal (P) neurons within a module. Likewise, each RFF neuron sends modulatory connections to all of the same P neurons. All AFB neurons connect to all RFF neurons within the same module, but some of these connections may have zero strength. balancing of these stellate to pyramidal connections that employs the
complementarity mechanism. Each primary pyramidal cell connects to one secondary pyramidal (S) for relay of output to the next level of processing and to one adaptive feedback (AFB) neuron. An AFB neuron makes simple adaptive connections to all of the RFF neurons within the module. The strength of each AFB to RFF connection, starting from zero, slowly "adapts" to the recent level of inhibition of the RFF cell. These adaptive feedback and regulatory feedforward cells are the major components of the adaptive error control mechanism. If the amount of excitation of an RFF cell exceeds the amount of its inhibition by a threshold amount, the RFF cell generates a regulatory error correction signal that projects to all primary pyramidal neurons within the module. Another mechanism used in the simulations is cooperativity. This mechanism restricts the primary pyramidal neurons to respond to the mutual action of multiple input connections. It also causes the initially random input connection strengths onto each neuron to approach
A Complementarity Mechanism for Enhanced Pattern Processing
65
values that reflect how much they represent particular information. Thus, in a variation of the Hebb (1949) neurophysiological postulate, frequent synaptic participation in the firing of a neuron in this model leads to an equalization of the strengths of the mutually active presynaptic connections. Although the strongest of these connections often get weaker as a result, the ensemble of reinforced mutually active connections grows stronger. This also applies to inhibitory connections onto neurons whose responses are complementary to those of the neurons receiving reinforced excitatory connections. The overall effect is to increase the accuracy of representing information. The mechanisms of gain control, adaptive error control, and cooperativity contributed to the performance of the network simulations reported here. But for the current paper, those three mechanisms must remain in the background. For further discussion see Adams (1989). Both the control simulations and the experimental simulations in which complementarity was deactivated equally employed the other mechanisms.
5 Testing Procedures
The simulation software was written in APL2, a high-level language developed by IBM and based on the theoretical work of K. E. Iverson, J. A. Brown, and T. More. This language was chosen for its power and flexibility in manipulating vectors and multidimensional arrays. Although APL2 supports speed in the development of code, it does not have a compiler. Consequently, the resulting code will not run as fast as optimized and compiled FORTRAN code, for example. The simulations were run on an IBM 3090-200VF. APL2 automatically invokes the fast vector hardware on this machine, but only a small amount of my code was able to take advantage of it due to the short vectors of parameters associated with each simulated neuron. The network was trained with eight variations in position and/or orientation of parallel lines (Fig. 3). This resulted in the development of feature selectors in each module for lines of each of the eight variations. The network was then tested with the eight different patterns of Figure 4, some geometric and others random combinations of short line segments. During this testing, the second processing level (PL2 in Fig. 1) developed responses representing the different patterns." Each pattern was presented until every module had fired at least 30 times since the last error (defined in the previous section). The performance of the network was measured in terms of (1)how well each pattern l'The use of line segments is not important to the model. What is significant is that feature-selective neurons in PLl evolve to represent combinations of input activity in IA and pattern-selective neurons in PL2 evolve to represent combinations of featureselective activity in PLI.
James L. Adams
66
______________
I************[ I I
j
I
************I I
I ************
I
II
II
I ************I
I
I
************I
************ I ************
************I
iI I
Figure 3: The training patterns (patterns 1 to 8). Each pattern is presented to level LA (input array) as a 12 x 12 binary array, where a 1 represents an input spike (shown here as an asterisk)and a 0 represents a nonfiring position (shown here as a blank).
was distinguished from all other patterns, (2) how many simulation cycles were required to reach the stated performance criterion, and (3) the number of errors produced in the process. Three control and three experimental simulations were run. The control simulations each began with a different set of random strength synaptic connections. The experimental simulations began with the same initial sets of connection strengths as the controls but were run with the complementarity mechanism deactivated.
6 Results
Both with and without complementarity, the overall network design incorporating gain control, cooperativity, and adaptive error control was able to achieve accurate discrimination of the patterns.]' However, the 12Theformula used to measure discriminationis given in Adams (1989); it is omitted here since discrimination was not a problem.
A Complementarity Mechanism for Enhanced Pattern Processing
* * **** I**** * * * * * *** I**** **** **** ** I * * *
I*
I:
I I****
i
I*
I
****
** ** ** * * ** * *
*
*****A**
* * ** * * * **** **** ****
I * I * I *
1: I
: I:**:
:
I
*
*
* * *** ** * ** * * * * *
* * * *
* * * *
******A*
________----
*** * ** **** * * * ** * * * *** ** ** *
* **** * * *** * * *** * * * * * * *
******** * * * * * **** * *** * * * * * **** * * * * * * * * * * * * * * ****
I I * I*
* * * * * * * A*** * *** * * *** * * * * * * * * * * * * **** * **** **** * * *
I*** * I *** *
67
**
** * ** * * * I *** ** * * * ** * I* * * ** I** **** ** 1 ** ** **** **
I**
* * * * II ** ** * * I* * * *** I * * * 1 * *** * * I* * * I**** * I * * I * ****
Figure 4: The test patterns (patterns 9 to 16). These patterns are presented in the same manner as the training patterns, except that they are not presented until after training has been completed for level PL1 (processing level 1). number of simulation cycles required to attain equivalent performance without complementarity was double what it was with complementarity.I3 The most dramatic evidence of the benefit of complementarity was the five times greater number of errors made by the network in the process of learning the patterns when complementarity was not used. This is illustrated in Figures 5 and 6.
7 Conclusions Computer simulations show that a complementarity mechanism of synaptic reinforcement markedly improves visual pattern learning. The network learns faster and responds more accurately when each feature or pattern classifying neuron has an equal opportunity to be excited to great13Simulation cycle computer processing that occurs for each simulated instant in time. In these simulations, a cycle representing an interval of about 10 msec of real time used nearly 1 sec of the computer’s Central Processing Unit (CPU) time. With complementarity, 35 min of CPU time simulating about 24 sec of real time was required for the system to learn all eight training patterns. Without complementarity 73 min of CPU time was required to reach equivalent performance. In terms of simulation cycles, the values were 2390 cycles with complementarity versus about 5000 cycles without it.
James L. Adam
68
t 500
COMPLEMENTARITY:
0USED,
RANOOM INITIAL STATE
DISABLED
NUMBER 1
8
0
L
7
8
FIRST EIGHT TRAINING SESSIONS
Figure 5: Errors with and without complementarity.Both sequencesbegan with identical values of the random-valued synaptic strengths. The errors are shown in the order of pattern presentation, which was the same for both cases. Any pair of columns represents a single pattern. Thus, as much as possible, any difference between the two cases is due only to the presence or absence of the complementarity mechanism. est activity by the feature or pattern it represents and to be inhibited to least activity by a complementary feature or pattern. Based on the experimental evidence for complementarity in early visual processing and the theoretical advantages demonstrated for its use in higher level processing, I predict that future studies of the brain will reveal its presence throughout the hierarchy of visual pattern processing. Moreover, even though this research was founded on the visual system, the complementarity mechanism would seem to apply equally well to other sensory, and even motor, hierarchies in the brain.
Acknowledgments This research was partially supported by NIH Grants 5 T32 MH1534506, 07. Computing resources were provided by the IBM Corporation
A Complementarity Mechanism for Enhanced Pattern Processing
.'
69
&.
.. .. ..@ ../ :"
oo..
'
t
I
I
I
2
I
I
I
100 200 300 400 ERRORS WITH COMPLEMENTARITY
I
I
J
500
Figure 6: Summary plot of relative errors with and without complementarity. Results from the control and experimental runs for three different sets of random initial values of synaptic strength. Pairs of error counts are used as the coordinates in this 2D plot. That is, the total errors in learning a given pattern with complementarity active is used for the horizontal position of a plotted point and the total errors in learning that same pattern with complementarity disabled is used for the vertical position. Each such point is connected by a line. The solid, dotted, and dashed lines show the results for three different sets of initial synaptic strengths. The solid line corresponds to the data in Figure 5. The 1:l line shows where the lines would fall if complementarity had no effect. under its Research Support Program. Faculty sponsor was J. D. Schlag, University of California, Los Angeles.
70
James L. Adams
References Adams, J. L. 1989. The Principles of "Complementarity," "Cooperativity," and "Adaptive Error Control" in Pattern Learning and Recognition: A Physiological Neural Network Model Tested by Computer Simulation. Ph.D. dissertation, University of California, Los Angeles. On file at University Microfilms Inc., Ann Arbor, Michigan. De Monasterio, F. M. and Gouras, P. 1975. Functional properties of ganglion cells of the rhesus monkey retina. J. Physiol. 251, 167-195. Dowling, J. E. 1979. Information processing by local circuits: The vertebrate retina as a model system. In The Neurosciences Fourth Study Program, F. 0. Schmitt and F, G. Worden, eds. MIT Press, Cambridge, MA. Hagins, W. A. 1979. Excitation in vertebrate photoreceptors. In The Neurosciences Fourth Study Program, F. 0. Schmitt and F. G. Worden, eds. MIT Press, Cambridge, MA. Hebb, D. 0. 1949. The Organization of Behavior. Wiley, New York. Hubel, D. H. and Wiesel, T. N. 1977. Functional architecture of macaque monkey visual cortex. Proc. R. SOC. London, B 198, 1-59. Lorente de NO, R. 1949. The structure of the cerebral cortex. In Physiology of the Nervous System, 3rd ed., J. F. Fulton, ed. Oxford University Press, New York. Lund, J. S. 1981. Intrinsic organization of the primate visual cortex, area 17, as seen in Golgi preparations. In The Organization of the Cerebral Cortex, F. 0.Schmitt, F. G. Worden, G. Adelman, and S. G. Dennis, eds. MlT Press, Cambridge, MA. Schiller, P. H., Sandell, J. H., and Maunsell, J. H. R. 1986. Functions of the ON and OFF channels of the visual system. Nature (London) 322, 824-825. Van Essen, D. C. 1985. Functional organization of primate visual cortex. In Cerebral Cortex, Vol. 3, A. Peters and E. G. Jones, eds. Plenum, New York. Werblin, F. S. 1973. The control of sensitivity in the retina. Sci. Am. January, 70-79.
Received I1 July 1989; accepted 30 November 1989.
Communicated by Graeme Mitchison
Hebb-Type Dynamics is Sufficient to Account for the Inverse Magnification Rule in Cortical Somatotopy Kamil A. Grajski* Michael M. Merzenich Coleman Memorial Laboratories, University of California, San Francisco, CA 94143 USA
The inverse magnification rule in cortical somatotopy is the experimentally derived inverse relationship between cortical magnification (area of somatotopic map representing a unit area of skin surface) and receptive field size (area of restricted skin surface driving a cortical neuron). We show by computer simulation of a simple, multilayer model that Hebb-type synaptic modification subject to competitive constraints is sufficient to account for the inverse magnification rule.
1 Introduction The static properties of topographic organization in the somatosensory system are well-characterized experimentally. Two somatosensory mapderived variables are of particular interest: (1)receptive field size and (2) cortical magnification. Receptive field size for a somatosensory neuron is defined as the restricted skin surface (measured in mm2) which maximally drives (measured by pulse probability) the unit. Cortical magnification is the cortical region (measured in mm2) over which neurons are driven by mechanical stimulation of a unit area of skin. Somatotopic maps are characterized by an inverse relationship between cortical magnification and receptive field size (Sur et al. 1980). Recent experimental studies of reorganization of the hand representation in adult owl monkey cortex bear directly on the inverse magnification rule. First, cortical magnification is increased (receptive field sizes decreased) for the one or more digit tips stimulated over a several week period in monkeys undergoing training on a behavioral task (Jenkins et al. 1989). Second, Jenkins and Merzenich (1987) reduced cortical magnification (increased receptive field sizes) for restricted hand surfaces by means of focal cortical lesions. ~
'Present address: Ford Aerospace, Advanced Development Department/MSX-22, San Jose, CA 95161-9041 USA.
Neural Computation 2, 71-84 (1990)
0 1990
Massachusetts Institute of Technology
72
Kamil A. Grajski and Michael M. Merzenich
Previous neural models have focused on self-organized topographic maps (Willshaw and von der Malsburg 1976; Takeuchi and Amari 1979; among others). The model networks typically consist of a prewired multilayer network in which self-organizing dynamics refines the initially random or roughly topographic map. The capacity of self-organization typically takes the form of an activity-dependent modification for synaptic strengths, e.g., a Hebb-type synapse, and is coupled with mechanisms for restricting activity to local map regions, that is, lateral inhibition. Finally, input to the system contains correlational structure at distinct spatiotemporal scales, depending on the desired complexity of model unit response properties. Are these mechanisms sufficient to account for the new experimental data? In this report we focus on the empirically derived inverse relationship between receptive field size and cortical magnification. We present a simple, multilayer model that incorporates Hebb-type synaptic modification to simulate the long-term somatotopic consequences of the behavioral and cortical lesion experiments referenced above.
2 The Model
The network consists of three hierarchically organized two-dimensional layers (see Fig. I). A skin (S) layer consists of a 15 x 15 array of nodes. Each node projects to the topographically equivalent 5 x 5 neighborhood of excitatory (E) cells located in the subcortical (SC) layer. We define the skin layer to consist of three 15 x 5 digits (Left, Center, Right). A standard-sized stimulus (3 x 3) is used for all inputs. Each of the 15 x 15 SC nodes contains an E and an inhibitory (I) cell. The E cell acts as a relay cell, projecting to a 7 x 7 neighborhood of the 15 x 15 cortical (C) layer. In addition, each relay cell makes collateral excitatory connections to a 3 x 3 local neighborhood of inhibitory cells; each inhibitory cell makes simple hyperpolarizing connections with a local neighborhood (5 x 5) of SC E cells. Each of the 15 x 15 C nodes also contains an E and I cell. The E cell is the exclusive target of ascending connections and the exclusive origin of descending connections: the descending connections project to a 5 x 5 neighborhood of E and I cells in the topographically equivalent SC zone. Local C connections include local E to E cell connections (5 x 5) and E to I cell connections (5 x 5). The C I cells make simple hyperpolarizing connections intracortically to E cells in a 7 x 7 neighborhood. No axonal or spatial delay is modelled; activity appears instantaneously at all points of projection. (For all connections, the density is constant as a function of distance from the node. A correction term is calculated for each type of connection in each layer to establish planar boundary conditions.) The total number of cells is 900; the total number of synapses is 57,600.
Hebb-Type Dynamics is Sufficient
73
A.) PROJECTION PATHWAYS
C
8.) LOCAL CIRCUITS
C
tl -
7
C S C S
f
15x15
I
Excltatlon
@$g
1 Inhibition C = CORTEX SC = SUBCORTEX
S = SKIN
Figure 1: Organization of the network model. (A) Layer organization. On the left is shown the divergence of projections from a singIe skin site to the subcortex and its subsequent projection to the cortex: Skin 6)to Subcortex (SC), 5 x 5; SC to Cortex (C), 7 x 7. S is "partitioned into three 15 x 5 "digits" Left, Center, and Right. The standard S stimulus used in all simulations is shown lying on digit Left. On the right is shown the spread of projection from C to SC, 5 x 5. (B) Local circuits. Each node in the SC and C layers contains an excitatory (E) and inhibitory cell (I). In C, each E cell forms excitatory connections with a (5 x 5 ) patch of I cells; each I cell forms inhibitory connections with a 7 x 7 patch of E cells. In SC, these connections are 3 x 3 and 5 x 5, respectively. In addition, in C only, E celIs form excitatory connections with a 5 by 5 patch of E cells. The spatial relationship of E and I cell projections for the central node is shown at left. Systematic variation of these connectivity patterns is discussed elsewhere (Grajski and Merzenich, in preparation). The model neuron is the same for all E and I cells: an RC-time constant membrane that is depolarized and (additively) hyperpolarized by linearly weighted connections:
Kamil A. Grajski and Michael M. Merzenich
74
3
3
u,".' - membrane potential for unit i of type Y on layer X ; u,".' firing rate for unit z of type Y on layer X ; 6; - skin units are OFF (= 0 ) or ON (= 1); 7, - membrane time constant (with respect to unit time). w z3 post(z~y)pre(X,Y)- connection to unit z of postsynaptic type y on postsynaptic layer 5 from units of presynaptic type Y on presynaptic layer X . Each summation term is normalized by the number of incoming connections (corrected for planar boundary conditions) contributing to the term. (Numerical integration is with the Adams-Bashforth predictor method bootstrapped with the fourth-order RK method; with the step size used (0.2), a corrector step is unnecessary.) Each unit converts membrane potential to a continuous-valued output value u, via a sigmoidal function representing an average firing rate (P = 4.0):
Synaptic strength is modified in three ways: (1) activity-dependent change, (2) passive decay, and ( 3 ) normalization. Activity-dependent and passive decay terms are as follows:
w , ~- connection from cell j to cell i; 7syn = 0.017, = 0.005 - time constant for passive synaptic decay; CY = 0.05, the maximum activitydependent step change; uj, v, - pre- and postsynaptic output values, respectively. Further modification occurs by a multiplicative normalization performed over the incoming connections for each cell. The normalization is such that the summed total strength of incoming connections is R:
Ni - number of incoming connections for cell i; wij - connection from cell j to cell i; R = 2.0 - the total resource available to cell i for redistribution over its incoming connections. Network organization is measured using procedures that mimic experimental techniques. Figure 2 shows the temporal response to an applied standard stimulus. Figure 3 shows the stereotypical spatiotemporal response pattern to the same input stimulus. Our model differs from those cited above in that inhibition is determined by network dynamics, not by applying a predefined lateral-inhibition function. Cortical magnification is measured by "mapping" the network, for example, noting which 3 x 3 skin patch most strongly drives each cortical
Hebb-Type Dynamics is Sufficient
75
LEGEND
...... ......
CORTICALECELL CORTICALWELL THALAMIC E-CELL THALAMIC ICELL
STIMULUS SITE
1
STIMULUS DURATION
-1.0 I
0
4
8
12
TIME (NORMALIZED UNITS)
Figure 2: Temporal response of the self-organizing network to a pulsed input at the skin layer site shown at lower right. Temporal dynamics are a result of net architecture and self-organization - no lateral inhibitory function is explicitly applied.
E cell. The number of cortical nodes driven maximally by the same skin site is the cortical magnification for that skin site. Receptive field size for a C (SC) layer E cell is estimated by stimulating all possible 3 x 3 skin patches (169) and noting the peak response. Receptive field size is defined as the number of 3 x 3 skin patches which drive the unit at 2 50% of its peak response. 3 Results
The platform for simulating the long-term somatotopic consequences of the behavioral and cortical lesion experiments is a refined topographic network. The initial net configuration is such that the location of individual connections is predetermined by topographic projection. Initial connection strengths are drawn from a Gaussian distribution ( p = 2.0, uz = 0.2). The refinement process is iterative. Standard-sized skin patches are stimulated (pulse ON for 2.5 normalized time steps, then OFF) in random sequence, but lying entirely within the single digit borders "defined" on the skin layer, that is, no double-digit stimulation. Note, however, that during mapping all possible (169) skin patches are stimulated so that double-digit fields (if any) can be detected. For each patch, the
76
Kamil A. Grajski and Michael M. Merzenich
Figure 3: Spatiotemporal response of the self-organizingnetwork to same stimulus as in Figure 2. Surround inhibition accompanies early focal excitation, which gives way to late, strong inhibition. network is allowed to reach steady-state while the plasticity rule is ON. Immediately following steady-state, synaptic strengths are renormalized as described above. (Steady-state is defined as the time point at which
Hebb-Type Dynamics is Sufficient
77
C and X layer E cell depolarization changes by less than 1%.Time to reach steady state averages 3.54.5 normalized time units.) The refinement procedure continues until two conditions are met: (1) fewer than 5% of all E cells change their receptive field location and (2) receptive field areas (using the 50% criterion) change by no more than fl unit area for 95% of E cells. Such convergence typically requires 10-12 complete passes over the skin layer. Simulated mapping experiments show that the refined map captures many features of normal somatotopy First, cortical magnification is proportional to the frequency of stimulation: (1)equal-sized representations for each digit but ( 2 ) greater magnification for the surfaces located immediately adjacent to each digit’s longitudinal axis of symmetry. Second, topographic order is highly preserved in all directions within and between digit representations. Third, discontinuities occur between representations of adjacent digits. Fourth, receptive fields are continuous, single-peaked, and roughly uniform in size. Fifth, for adjacent (withindigit) sites, receptive fields overlap significantly (up to 70%, depending on connectivity parameters); overlap decreases monotonically with distance. (Overlap is defined as the intersection of receptive field areas as defined above.) Last, the basis for refinement of topographic properties is the emergence of spatial order in the patterning of afferent and efferent synaptic weights. (See Discussion.) Jenkins et al. (1989) describe a behavioral experiment that leads to cortical somatotopic reorganization. Monkeys are trained to maintain contact with a rotating disk situated such that only the tips of one or two of their longest digits are stimulated. Monkeys are required to maintain this contact for a specified period of time in order to receive food reward. Comparison of pre- and post-stimulation maps (or the latter with maps obtained after varying periods without disk stimulation) reveal up to nearly %fold differences in cortical magnification and reduction in receptive field size for stimulated skin. We simulate the above experiment by extending the refinement process described above, but with the probability of stimulating a restricted skin region increased 5:l. Figure 4 shows the area selected for stimulation as well as its pre- and poststimulation SC and C layer representations. Histograms indicate the changes in distributions of receptive field sizes. The model reproduces the experimentally observed inverse relationship between cortical magnification and receptive field size (among several other observations). Subcortical results show an increase in area of representation, but with no significant change in receptive field areas. The subcortical results are predictive - no direct subcortical measurements were made for the behavioral study monkeys. The inverse magnification rule predicts that a decrease in cortical magnification is accompanied by an increase in receptive field areas. Jenkins and Merzenich (1987) tested this hypothesis by inducing focal cortical lesions in the representation of restricted hand surfaces, for example, a
Kamil A. Grajski and Michael M. Merzenich
78
SKIN SURFACE COACTIVATION INllALCORllCAL
FINAL CORTICAL
INllAL SUBCORTICAL
ANALSUBMRTICAL
CO-ACTIVATED SKIN
Figure 4: (A) Simulation of behaviorally controlled skin stimulation experiment (Jenkinset al. 1990). The initial and final zones of (sub)corticalrepresentation of coactivated skin are shown. The coactivated skin (shown at far left) is stimulated 5:l over remaining skin sites using a 3 by 3 stimulus (see Figs. 1-3). single digit. A long-term consequence of such a manipulation is profound cortical map reorganization characterized by (1) a reemergence of a representation of the skin formerly represented in the now lesioned zone in the intact surrounding cortex, (2) the new representation is at the expense of cortical magnification of skin originally represented in those regions, so that (3) large regions of the map contain neurons with abnormally large receptive fields, for example, up to several orders of magnitude larger. We simulate this experiment by "lesioning" the regton of the cortical layer shown in Figure 5. The refinement process described above is continued under these new conditions until topographic map and receptive field size measures converge. The reemergence of representation and changes in distributions of receptive field areas are also shown in Figure 5. The model reproduces the experimentally observed inverse relationship between cortical magnification and receptive field size (among many other phenomena). These results depend on randomizing the synaptic strengths of all remaining intracortical and cortical afferent
Hebb-Type Dynamics is Sufficient
79
A
0 m
-z r
I1
fn W
0.2
-
t fn
B 2
0 c
0.1
x
20 x
n
o.. n 1
2
3
4
5
6
7
8
9 1 0 1 1 1 2 1 3 1 4 1 5
AREA (ARBITRARY UNITS)
Normal Subconex BehStim Subcortex
K
0 n
Bn
AREA (ARBITRARY UNITS)
Figure 4: Cont’d. Histograms depict changes in receptive field areas for topographically equivalent, contiguous, equal-sized zones in (sub)cortical layer. There is a strong inverse relationship between cortical magnification and cortical receptive field size.
80
Kamil A. Grajski and Michael M. Merzenich
CORTICAL LESION EXPERIMENT IMll AL CORTICAL
FINAL CORTICAL
INTlAL SUBCORTlCAL
FINAL SUBCORllCAL
LESIONEDSKIN REPRESENTATlON
Figure 5: Simulation of a somatosensorycortical lesion experiment (Jenkinsand Merzenich 1987). The cortical zone representing the indicated skin regions is "destroyed." Random stimulation of the skin layer using a 3 by 3 patch leads to a reemergence of representation along the border of the lesion. Cortical magnification for all skin sites is decreased. connections as well as enhancing cortical excitation (afferent or intrinsic) or reducing cortical inhibition by a constant factor, 2.0 and 0.5, respectively, in equation set 1. Otherwise, the intact representations possess an overwhelming competitive advantage. In general, this simulation was more sensitive to the choice of parameters, whereas preceding ones were not (see Discussion). Again, the subcortical results are predictive as no direct experimental data exist.
4 Discussion
The inverse magnification rule is one of several "principles" suggested by experimental studies of topographic maps. It is a powerful one as it links global and local map properties. We have shown that a coactivation
81
Hebb-Type Dynamics is Sufficient
-
0.3 Normal Cortex Lesioned Cortex
0 v) r
u
I1
5 v)
w
0.2
5 U
0
z P c a
0.1
B 0 a n
0.0
4 15
1
2
3
7
6
5
4
8
9
1 0 1 1 1 2
AREA (ARBITRARY UNITS)
(b)
Normal Subcortex Lesioned Submrtex
1
I
0.3
0.2
0.1
0.0'
1 .
1
I . 1 .
2
3
I
4
'
I
5
'
I
6
'
I
7
.
I
8
. I
9
. I
I
'
I
' I
I
'
1 0 1 1 1 2 1 3 1 4 1 5
AREA (ARBITRARY UNITS) (C)
Figure 5: Cont'd. Histograms depicting changes in receptive field area for topographically equivalent, contiguous, equal-sized zones in (sub)corticallayers (left-most and right-most 1 / 3 combined). Note the increase in numbers of large receptive fields.
based synaptic plasticity rule operating under competitive conditions is sufficient to account for this basic organizational principle. What is the basis for tkese properties? Why does the model behave the way it does? The basis for these effects is the emergence and
a2
Kamil A. Grajski and Michael M. Merzenich
reorganization of spatial order in the patterning of synaptic weights. In the unrefined map, connection strengths are distributed N(2.0, 0.2). Following stimulation this distribution alters to a multipeaked, widevariance weight distribution, with high-values concentrated as predicted by coactivation. For instance, some subcortical sites driven by the Center digit alone project to the cortex in such a way that parts of the projection zone cross the cortical digit representational border between Center and Right. The connection strengths onto cortical Center cells are observed to be one to two orders of magnitude greater than those onto cortical cells in the Right representation. Similarly, cortical mutually excitatory connections form local clusters of high magnitude. These are reminiscent of groups (see Pearson et a2. 1987). However, in contradistinction to the Pearson study, in this model, receptive field overlap is maintained (among other properties) (Grajski and Merzenich, in preparation). Synaptic patterning is further evaluated by removing pathways and restricting classes of plastic synapses. Repeating the above simulations using networks with no descending projections or using networks with no descending and no cortical mutual excitation yields largely normal topography and coactivation results. Restricting plasticity to excitatory pathways alone also yields qualitatively similar results. (Studies with a two-layer network yield qualitatively similar results.) Thus, the refinement and coactivation experiments alone are insufficient to discriminate fundamental differences between network variants. On the other hand, modeling the long-term consequences of digit amputation requires plastic cortical mutually excitatory connections (Grajski and Merzenich 1990). Simulations of the cortical lesion experiments are confounded by the necessity to redistribute synaptic resources following removal of 33% of the cortical layer’s cells. We explored randomization of the remaining cortical connections in concert with an enhancement of cortical excitation or reduction of cortical inhibition. Without some additional step such as this, the existing representations obtain a competitive advantage that blocks reemergence of representation. Whenever rerepresentation occurs, however, the inverse magnification relationship holds. The map reorganization produced by the model is never as profound as that observed experimentally, suggesting the presence of other nonlocal, perhaps injuryrelated and neuromodulatory effects (Jenkins and Merzenich 1987). This model extends related studies (Willshaw and von der Malsburg 1976; Takeuchi and Amari 1979; Ritter and Schulten 1986; Pearson et al. 1987). First, the present model achieves lateral inhibitory dynamics by a combination of connectivity and self-organization of those connections; earlier models either apply a lateral-inhibition function in local neighborhoods, or have other fixed inhibitory relationships. Second, the present model significantly extends a recently proposed model of somatosensory reorganization (Pearson et aI. 1987) to include (1) a subcortical layer, (2) simulations of additional experimental data, (3) more accurate simulation
Hebb-Type Dynamics is Sufficient
83
of normal and reorganized somatotopic maps, and (4) a simpler, more direct description of possible underlying mechanisms. The present model captures features of static and dynamic (relorganization observed in the auditory and visual systems. Weinberger and colleagues (1989) have observed auditory neuron receptive field plasticity in adult cats under a variety of behavioral and experimental conditions (see also Robertson and Irvine 1989). In the visual system, Wurtz et al. (1989) have observed that several weeks following induction of a restricted cortical lesion in monkey area MT, surviving neurons’ receptive field area increased. Kohonen (1982), among others, has explored the computational properties of self-organized topographic mappings. A better understanding of the nature of real brain maps may support the next generation of topographic and related computational networks (e.g., Moody and Darken 1989).
Acknowledgments This research supported by NIH Grants NS10414 and GM07449, Hearing Research Inc., the Coleman Fund, and the San Diego Supercomputer Center. K.A.G. gratefully acknowledges reviewers’ comments and helpful discussions with Terry Allard, Bill Jenkins, Gregg Recanzone, and especially Ken Miller.
References Grajski, K. A. and Merzenich, M. M. 1990. Neural network stimulation of somatosensor, representational plasticity. In Neural Information Processing Systems, Vol. 2. D. Touretzky, ed., in press. Grajski, K. A. and Merzenich, M. M. 1990. Hebbian synaptic plasticity in a multi-layer, distributed neural network model accounts for key features of normal and reorganized somatosensor (topographic) maps. In preparation. Jenkins, W. M. and Merzenich, M. M. (1987). Reorganization of neocortical representations after brain injury: A neurophysiological model of the bases of recovery from stroke. In Progress in Bruin Research, F. J. Seil, E. Herbert, and B. M. Carlson, eds., Vol. 71, pp. 249-266. Elsevier, Amsterdam. Jenkins, W. M., Merzenich, M. M., Ochs, M. T., Allard, T., and Guic-Robles, E. 1990. Functional reorganization of primary somatosensory cortex in adult owl monkeys after behaviorally controlled tactile stimulation. J. Neurophys., 63(1).
Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. B i d . Cybernet. 43,59-69. Moody, J. and Darken, C. J. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294.
84
Kamil A. Grajski and Michael M. Merzenich
Pearson, J. C., Finkel, L. H., and Edelman, G. M. 1987. Plasticity in the organization of adult cerebral cortical maps: A computer simulation based on neuronal group selection. J. Neurosci. 7, 42094223. Ritter, H. and Schulten, K. 1986. On the stationary state of Kohonen’s selforganizing sensory mapping. Biol. Cybernet. .54,99-106. Robertson, D. and Irvine, D. R. F. 1989. Plasticity of frequency organization in auditory cortex of guinea pigs with partial unilateral deafness. J. Comp. Neural 282, 456-471. Sur, M., Merzenich, M. M., and Kaas, J. H. 1980. Magnification, receptive-field area and “hypercolumn” s u e in areas 3b and 1 of somatosensory cortex in owl monkeys. j . Neurophys. 44,295-311. Takeuchi, A. and Amari, S. 1979. Formation of topographic maps and columnar microstructures in nerve fields. Bid. Cybernef. 35, 63-72. Weinberger, N. M., Ashe, J. H., Metherate, R., McKenna, T. M., Diamond, D. M., and Bakin, J. 1990. Retuning auditory cortex by learning: A preliminary model of receptive field plasticity. Concepts Neurosci., in press. Willshaw, D. J. and von der Malsburg, C. 1976. How patterned neural connections can be set up by self-organization. Proc. R. SOC. London B 194, 431445. Wurtz, R., Komatsu, H., Diirsteler, M. R., and Yamasaki, D. S. G. 1990. Motion to movement: Cerebral cortical visual processing for pursuit eye movements. In Signal and Sense: Local and Global Order in Perceptual Maps, E. W. Gall, ed. Wiley, New York, in press.
Received 25 September 1989; accepted 20 December 1989.
Communicated by Terrence J. Sejnowski
Optimal Plasticity from Matrix Memories: What Goes Up Must Come Down David Willshaw Peter Dayan Centre for Cognitive Science and Department of Physics, University of Edinburgh, Edinburgh, Scotland
A recent article (Stanton and Sejnowski 1989) on long-term synaptic depression in the hippocampus has reopened the issue of the computational efficiency of particular synaptic learning rules (Hebb 1949; Palm 1988a; Morris and Willshaw 1989) - homosynaptic versus heterosynaptic and monotonic versus nonmonotonic changes in synaptic efficacy. We have addressed these questions by calculating and maximizing the signal-to-noise ratio, a measure of the potential fidelity of recall, in a class of associative matrix memories. Up to a multiplicative constant, there are three optimal rules, each providing for synaptic depression such that positive and negative changes in synaptic efficacy balance out. For one rule, which is found to be the Stent-Singer rule (Stent 1973; Rauschecker and Singer 1979), the depression is purely heterosynaptic; for another (Stanton and Sejnowski 19891, the depression is purely homosynaptic; for the third, which is a generalization of the first two, and has a higher signal-to-noise ratio, it is both heterosynaptic and homosynaptic. The third rule takes the form of a covariance rule (Sejnowski 1977a,b) and includes, as a special case, the prescription due to Hopfield (1982) and others (Willshaw 1971; Kohonen 1972).
In principle, the association between the synchronous activities in two neurons could be registered by a mechanism that increases the efficacy of the synapses between them, in the manner first proposed by Hebb (1949); the generalization of this idea to the storage of the associations between activity in two sets of neurons is in terms of a matrix of modifiable synapses (Anderson 1968; Willshaw et al. 1969; Kohonen 1972). This type of architecture is seen in the cerebellum (Eccles ef al. 1968) and in the hippocampus (Marr 1971) where associative storage of the Hebbian type (Bliss and L0mo 1973) has been ascribed to the NMDA receptor (Collingridge et al. 1983; Morris et al. 1986). A number of questions concerning the computational power of certain synaptic Neural Computation 2, 85-93 (1990)
@ 1990 Massachusetts Institute of Technology
86
David Willshaw and Peter Dayan
modification rules in matrix memories have direct biological relevance. For example, is it necessary, or merely desirable, to have a rule for decreasing synaptic efficacy under conditions of asynchronous firing, to complement the increases prescribed by the pure Hebbian rule (Hebb 1949)? The need for a mechanism for decreasing efficacy is pointed to by general considerations, such as those concerned with keeping individual synaptic efficacies within bounds (Sejnowski 197%); and more specific considerations, such as providing an explanation for ocular dominance reversal and other phenomena of plasticity in the visual cortex (Bienenstock et al. 1982; Singer 1985). There are two types of asynchrony between the presynaptic and the postsynaptic neurons that could be used to signal a decrease in synaptic efficacy (Sejnowski 197%; Sejnowski et al. 1988): the presynaptic neuron might be active while the postsynaptic neuron is inactive (homosynaptic depression), or vice versa (heterosynaptic depression). We have explored the theoretical consequences of such issues. We consider the storage of a number R of pattern pairs [represented as the binary vectors A(w) and B(w) of length rn and n, respectively] in a matrix associative memory. The matrix memory has m input lines and n output lines, carrying information about the A-patterns and the B-patterns, respectively, each output line being driven by a linear threshold unit (LTU) with m variable weights (Fig. 1). Pattern components are generated independently and at random. Each component of an A-pattern takes the value 1 (representing the active state) with probability p and the value c (inactive state) with probability 1 - p . Likewise, the probabilities for the two possible states 1 and c for a component of a B-pattern are T , 1 - T . In the storage of the association of the wth pair, the amount A, by which the weight W,, is changed depends on the values of the pair of numbers [AAw),B,(w)l. Once the entire set of patterns has been learned, retrieval of a previously stored B-pattern is effected by the presentation of the corresponding A-pattern. The 3th LTU calculates the weighted sum of its inputs, d, [A(w)l,
The state of output line j is then set to c or 1, according to whether d,[A(w)] is less than or greater than the threshold 8,. The signal-to-noise ratio p is a measure of the ability of an LTU to act as a discriminator between those A(w) that are to elicit the output c and those that are to elicit the output 1. It is a function of the parameters of the system, and is calculated by regarding dj[A(w)] as the sum of two components: the signal, sj(w),whch stems from that portion of the
Optimal Plasticity from Matrix Memories
87
Figure 1: The matrix memory, which associates A-patterns with B-patterns. Each weight W,j is a linear combination over the patterns:
fl where A is given in the table below.
The matrix shows the steps taken in the retrieval of the pattern B(w) that was previously stored in association with A(w). For good recall, the calculated output B' must resemble the desired output B(w).
David Willshaw and Peter Dayan
88
weights arising from the storage of pattern w,and the noise, n,(w),which is due to the contribution from all the other patterns to the weights.
i=l
In most applications of signal-to-noise (S/N) analysis, the noise terms have the same mean and are uncorrelated between different patterns. When these assumptions are applied to the current model, maximizing the signal-to-noise ratio with respect to the learning rule parameters a, p, y, and 6, leaves them dependent on the parameter c (Palm 1988b). However, the mean of the noise n,(w) in equation 1 is biased by the exclusion of the contribution Aij(w), whose value depends on the target output for pattern w; and the noise terms for two different patterns w1 and w2 are in general correlated through the R - 2 contributions to the value of Ail(w), which occur in both terms. Our analysis (Fig. 2) takes account of these factors, and its validity is confirmed by the results of computer simulation (Table 1).Maximizing the expression we obtain for the signal-to-noise ratio in terms of the learning parameters leads to the three c-independent rules, R1, R2, and R3. To within a multiplicative constant they are
Rule R1 is a generalization of the Hebb rule, called the covariance rule (Sejnowski 1977a; Sejnowski et al. 1988; Linsker 1986). In this formulation, the synaptic efficacy between any two cells is changed according to the product of the deviation of each cell’s activity from the mean. When pattern components are equally likely to be in the active and the inactive states ( p = T = 1/2), R1 takes the form of the ”Hopfield” rule (Hopfield 1982), and has the lowest signal-to-noise ratio of all such rules. Rule R1 prescribes changes in efficacy for all of the four possible states of activity seen at an individual synapse, and thus utilizes both heterosynaptic and homosynaptic asynchrony. It also has the biologically undesirable property that changes can occur when neither pre- nor postsynaptic neuron is
Optimal Plasticity from Matrix Memories
Low Mean
89
High Mean
Figure 2: Signal-to-noiseratios. The frequency graph of its linear combinations d(w) for a given LTU. The two classes to be distinguished appear as approximately Gaussian distributions, with high mean p h , low mean pl, and variances u i , u:, where ui E u:. For good discrimination the two distributions should be narrow and widely separated. In our calculation of the signal-to-noise ratio, the mean of the noise n ( w ) (equation 1) differs for high and low patterns, and so the expressions for the expected values of p h and were calculated separately. Second, the correlations between the noise terms obscuring different patterns add an extra quantity to the variance of the total noise. The entire graph of the frequency distributions for lugh and low patterns is displaced from the expected location, by a different amount for each unit. This overall displacement does not affect the power of the unit to discriminate between patterns. In calculating the signal-to-noise ratio, it is therefore appropriate to calculate the expected dispersion of the noise about the mean for each unit, rather than using the variance, which would imply measuring deviations from the expected mean. The expected dispersion for high patterns is defined as
H being the number of w for which B ( w ) = 1, and sf is defined similarly as the expected dispersion for low patterns. The signal-to-nolse ratio for a discriminator is therefore defined as p = (E[Ph - P11)' ;b; + s,; ~
It depends on all the parameters of the system, and may be maximized with respect to those that define the learning rule, a,P, 7,and b . The maxima are found at the rules R1, R2,and R3 described in the text. The effect of changing c is to shift and compress or expand the distributions. For a given LTU, it is always possible to move the threshold with c in such a way that exactly the same errors are made (Table la). The choice of c partly determines the variability of Ilh and p, across the set of units, and this variability is minimized at c = -p/(l - p). With this value of c , and in the limit of large n ~ , its effect becomes negligible, and hence the thresholds for all the units may be set equal. n, and
a,
active ( a # 0). However, the change to be applied in the absence of any activity can be regarded as a constant background term of magnitude pr. In rule R2, the so-called Stent-Singer rule (Stent 1973; Rauschecker and Singer 1979),depression is purely heterosynaptic. For a given number of stored associations, the signal-to-noise ratio for R2 is less than that for R1 by a factor of 1/(1 - r ) . In rule R3, which Stanton and Sejnowski (1989) proposed for the mossy fibers in the hippocampal CA3 region, and which is also used in theoretical schemes (Kanerva 1988), depression is purely
90
David Willshaw and Peter Dayan
homosynaptic. R3 has a signal-to-noise ratio less than R1 by a factor of 1/(1- p ) . If p = T , R2 and R3 have the same signal-to-noise ratio. All the rules have the automatic property that the expected value of each weight is 0; that is, what goes up does indeed come down. One way of implementing this property that avoids the necessity of synapses switching between excitatory and inhibitory states is to assign each synapse a constant positive amount of synaptic efficacy initially. Our results do not apply exactly to this case, but an informal argument suggests that initial synaptic values should be chosen so as to keep the total synaptic efficacy as small as possible, without any value going negative. Given that it is likely that the level of activity in the nervous system is relatively low (< lo%), it is predicted that the amount of (homosynaptic) long-term potentiation (Bliss and L0mo 1973) per nerve cell will be an order of magnitude greater than the amount of either homosynaptic or heterosynaptic depression. Further, under R1, any experimental technique for investigating long-term depression that relies on the aggregate effect on one postsynaptic cell of such sparse activity will find a larger heterosynaptic than homosynaptic effect. As for the Hopfield case (Willshaw 1971; Kohonen 1972; Hopfield 1982), for a given criterion of error (as specified by the signal-to-noise ratio) the number of associations that may be stored is proportional to the size, m, of the network. It is often noted (Willshaw et aZ. 1969; Amit et at. 1987; Gardner 1987; Tsodyks and Feigel'man 1988) that the sparser the coding of information (i.e., the lower the probability of a unit being active) the more efficient is the storage and retrieval of information. This is also true for rules R1, R2, and R3, but the information efficiency of the matrix memory, measured as the ratio of the number of bits stored as associations to the number of bits required to represent the weights, is always less than in similar memories incorporating clipped synapses (Willshaw et aZ. 19691, that is, ones having limited dynamic range. The signal-to-noise ratio measures only the potential of an LTU to recall correctly the associations it has learned. By contrast, the threshold 6, determines the actual frequency of occurrence of the two possible types of misclassification. The threshold may be set according to some further optimality criterion, such as minimizing the expected number of recall errors for a pattern. For a given LTU, the optimal value of 6 will depend directly on the actual associations it has learned rather than just on the parameters generating the patterns, which means that each LTU should have a different threshold. It can be shown that, as m, n, and R grow large, setting c at the value - p / U - p ) enables the thresholds of all the LTUs to be set equal (and dependent only on the parameters, not the actual patterns) without introducing additional error. Although natural processing is by no means constrained to follow an optimal path, it is important to understand the computational consequences of suggested synaptic mechanisms. The signal-to-noise ratio
Optimal Plasticity from Matrix Memories
la
lb
p,r 0.5 0.4 0.3 0.2 0.5 0.5 0.5 0.5
c -1 -1 -1 -1 -1 -0.5 0 0.5
p,r 0.5 0.4 0.3 0.2 0.1 0.05
c
0 0 0 0 0 0 p,r
lc
0.5 0.4 0.3 0.2 0.1 0.05
Expect
Actual
S/N 10 7.5 1.4 0.25 10 10 10
S/Nfo 11 H . 3 8.3f1.5 1.3f 0.40 0.32 f 0 . 2 2 11 f 1.3 11 f 1.3 11 f 1.3 11 =t1.3
10 Expect
Actual
S/N 0.05 0.11 0.31 1.1 5.9 16
S / N h O.lOfO.11 0.11 i 0 . 0 9 0.34 f 0.15 1.2 f0.47 5.3 f 1.8 28f18
R1 10 11 12 16 28 54
R2, R3 5.1 6.4 8.5 13 26 51
Hebb
91
Previous Expect S/N errors
Actual errors
10 10 11 12
1.1 1.6 4.5 4.2 1.1 1.1 1.1 1.1
1.1 1.7 4.6 4.0 1.1 1.1 1.1 1.1
Previous Expect Actual errors errors
S/N 6.8 7.6 9.4 13 26 51
9.1 7.8
5.8 3.6 0.92 0.16
8.7 7.6 5.9 3.4 1.2 0.15
Hopfield
0.050 10 0.11 7.5 0.31 1.4 1.1 0.25 5.9 0.045 16 0.015
Table 1: Simulations. The object of the simulations was to check the formulae developed in our analysis and compare them with a previous derivation (Palm 1988b). The matrix memory has m = 512 input lines and TZ = 20 output lines. To ensure noticeable error rates, the number of pattern pairs was set at 0 = 200. In all cases p = r.
la: The Hopfield (1982) rule (a,8,-,,6) = (1, -1, -1,l). Columns 3 and 4 compare the S/N ratio expected from our analysis and that measured in the simulation, the latter also showing the standard error measured over the output units; column 5 gives the S/N ratio calculated on the basis of previous analysis (Palm 1988b). Columns b and 7 compare the expected and measured numbers of errors per pattern, the threshold being set so that the two possible types of error occurred with equal frequency. For good recall (< 0.03 errors per unit) the S / N ratio must be at least 16. The lack of dependence on the value of c is demonstrated in rows 5-8. The same patterns were used in each case. lb: Similar results for the nonoptimal Hebb (1949) rule (a,psy. 6 ) = (0,fl. fl,I). lc: Values of the signal-to-noise ratio for the rules R1, R2, and R3 and the Hebb and the Hopfield rules. R1 has higher signal-to-noise ratio than R2 and R3, but for the latter two it is the same since p = T here. The Hebb rule approaches optimality in the limit of sparse coding; conversely, the Hopfield rule is optimal at p = T = 112.
92
David Willshaw and Peter Dayan
indicates how good a linear threshold unit may be at its discrimination task, and consequently how much information can be stored by a network of a number of such units. Synaptic depression is important for computational reasons, independent of any role it might play in preventing saturation of synaptic strengths. Up to a multiplicative constant, only three learning rules maximize the signal-to-noise ratio. Each rule involves both decreases and increases in the values of the weights. One rule involves heterosynaptic depression, another involves homosynaptic depression, and in the third rule there is both homosynaptic and heterosynaptic depression. All rules work most efficiently when the patterns of neural activity are sparsely coded. Acknowledgments We thank colleagues, particularly R. Morris, T. Bliss, P. Hancock, A. Gardner-Medwin, and M. Evans, for their helpful comments and criticisms on an earlier draft. This research was supported by grants from the MRC and the SERC. References Amit, D. J., Gutfreund, H., and Sompolinsky, H. 1987. Information storage in neural networks with low levels of activity. Phys. Rev. A. 35,2293-2303. Anderson, J. A. 1968. A memory storage model utilizing spatial correlation functions. Kybernetik 5, 113-119. Bienenstock, E., Cooper, L.N., and Munro, P. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2, 3248. Bliss, T. V. P. and Lsmo, T. 1973. Long-lastingpotentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol. (London) 232, 331-356. Collingridge, G. L., Kehl, S. J., and McLennan, H. J. 1983. Excitatory amino acids in synaptic transmission in the Schaffer collateral-commissural pathway of the rat hippocampus. J. Physiol. (London) 334, 33-46. Eccles, J. T., Ito, M., and Szentbgothai, J. 1968. The Cerebellum as a Neuronai Machine. Springer Verlag, Berlin. Gardner, E. 1987. Maximum storage capacity of neural networks. Europhys. Lett. 4, 481-485. Hebb, D. 0. 1949. The Organization of Behavior. Wiley, New York. Hopfield, J. J. 1982. Neural networks and physical systems with emergent computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Kanerva, P. 1988. Sparse Distributed Memory. MIT Press/Bradford Books: Cambridge, MA. Kohonen, T. 1972. Correlation matrix memories. IEEE Trans. Comput. C-21, 353-359.
Optimal Plasticity from Matrix Memories
93
Linsker, R. 1986. From basic network principles to neural architecture: Emergence of spatial opponent cells. Proc. Natl. Acad. Sci. U.S.A. 83, 750S7512. Marr, D. 1971. Simple memory: A theory for archicortex. Phil. Trans. R. SOC. London B 262, 23-81. Morris, R. G. M., Anderson, E., Baudry, M., and Lynch, G. S. 1986. Selective impairment of learning and blockade of long term potentiation in vivo by AP5, an NMDA antagonist. Nature (London) 319,774-776. Morris, R. G. M. and Willshaw, D. J. 1989. Must what goes up come down? Nature (London) 339, 175-176. Palm, G. 1988a. On the asymptotic information storage capacity of neural networks. In Neural Computers, R. Eckmiller, and C. von der Malsburg, eds. NATO AS1 Series F41, pp. 271-280. Springer Verlag, Berlin. Palm, G. 1988b. Local synaptic rules with maximal information storage capcity. In Neural and Synergetic Computers, Springer Series in Synergetics, H. Haken, ed., Vol. 42, pp. 100-110. Springer-Verlag, Berlin. Rauschecker, J. P. and Singer, W. 1979. Changes in the circuitry of the kitten’s visual cortex are gated by postsynaptic activity. Nature (London) 280, 58-60. Sejnowski, T. J. 1977a. Storing covariance with nonlinearly interacting neurons. J. Math. Biol. 4, 303-321. Sejnowski, T. J. 197%. Statistical constraints on synaptic plasticity. 1. Theor. Biol. 69,385-389. Sejnowski, T. J., Chattarji, S., and Stanton, P. 1988. Induction of synaptic plasticity by Hebbian covariance. In The Computing Neuron, R. h r b i n , C . Miall, and G. Mitchison, eds., pp. 105-124. Addison-Wesley,Wokingham, England. Singer, W. 1985. Activity-dependent self-organization of synaptic connections as a substrate of learning. In The Neural and Molecular Bases of Learning, J. P. Changeux and M. Konishi, eds., pp. 301-335. Wiley, New York. Stanton, P. and Sejnowski, T. J. 1989. Associative long-term depression in the hippocampus: Induction of synaptic plasticity by Hebbian covariance. Nature (London) 339, 215-218. Stent, G. S. 1973. A physiological mechanism for Hebb’s postulate of learning. Proc. Natl. Acad. Sci. U.S.A. 70,997-1001. Tsodyks, M. V. and Feigel’man, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Lett. 6, 101-105. Willshaw, D. J. 1971. Models of Distributed Associative Memory. Ph.D. Thesis, University of Edinburgh. Willshaw, D. J., Buneman, 0. P., and Longuet-Higgins, H. C. 1969. Nonholographic associative memory. Nature (London) 222, 960-962.
Received 28 November 1989; accepted 19 December 1989.
Communicated by Terrence J. Sejnowski
Pattern Segmentation in Associative Memory DeLiang Wang Joachim Buhmann Computer Science Department, University of Southern California, University Park, Los Angeles, CA 90089-0782 USA
Christoph von der Malsburg Program in Neural, Informational, and Behavioral Sciences, University of Southern California, University Park, Los Angeles, CA 90089-0782 USA
The goal of this paper is to show how to modify associative memory such that it can discriminate several stored patterns in a composite input and represent them simultaneously. Segmention of patterns takes place in the temporal domain, components of one pattern becoming temporally correlated with each other and anticorrelated with the components of all other patterns. Correlations are created naturally by the usual associative connections. In our simulations, temporal patterns take the form of oscillatory bursts of activity. Model oscillators consist of pairs of local cell populations connected appropriately. Transition of activity from one pattern to another is induced by delayed selfinhibition or simply by noise. 1 Introduction
Associative memory (Steinbuch 1961;Willshaw et al. 1969; Hopfield 1982) is an attractive model for long-term as well as short-term memory. Especially the Hopfield formulation (Hopfield 1982) provides for both levels a clear definition of data structure and mechanism of organization. The data structure of long-term memory has the form of synaptic weights for the connections between neurons, and memory traces are laid down with the help of Hebbian plasticity. On the short-term memory level the data structure has the form of stationary patterns of neural activity, and these patterns are organized and stabilized by the exchange of excitation and inhibition. Since in this formulation short-term memory states are dynamic attractor states, one speaks of attractor neural networks. Neurons are interpreted as elementary symbols, and attractor states acquire their symbolic meaning as an unstructured sum of individual symbolic contributions of active neurons. The great virtue of associative memory Neural Computation 2,96106 (1990)
@ 1990 Massachusetts Institute of Technology
Pattern Segmentation in Associative Memory
95
is its ability to restore incomplete or corrupted input patterns, that is, its ability to generalize over Hamming distance (the number of bits missing or added). Let us just mention here, since it becomes relevant later, that associative memory can be formulated such that attractors correspond to oscillatory activity vectors instead of stationary ones (Li and Hopfield 1989; Baird 1986; Freeman et al. 1988). Associative memory, taken as a model for functions of the brain, is severely limited in its applicability by a particular weakness - its low power of generalization. This is a direct consequence of the fact that associative memory treats memory traces essentially as monolithic entities. An obvious and indispensable tool for generalization in any system must be the decomposition of complex patterns into functional components and their later use in new combinations. A visual scene is almost always composed of a number of subpatterns, corresponding to coherent objects that are very likely to reappear in different combinations in other scenes (or the same scene under a different perspective and thus in different spatial relations to each other). Associative memory is not equipped for this type of generalization, as has been pointed out before (von der Malsburg 1981, 1983, 1987). It treats any complex pattern as a synthetic whole, glues all pairs of features together, and recovers either the whole pattern or nothing of it. Two different arrangements of the same components cannot be recognized as related and have to be stored separately. There is no generalization from one scene to another, even if they are composed of the same objects. Since complex scenes never recur, a nervous system based on the associative memory mechanism alone possesses little ability to learn from experience. This situation is not specific to vision. Our auditory world is typified by complex sound fields that are composed of sound streams corresponding to independent sources. Take as an example the cocktail party phenomenon where we are exposed to several voices of people who talk at the same time. It would be useless to try to store and retrieve the combinations of sounds heard simultaneously from different speakers. Instead, it is necessary to separate the sound streams from each other and store and access them separately. Similar situations characterize other modalities and especially all higher levels of cognitive processing. The basis for the type of generalization discussed here is the specific and all-pervasive property of our world of being causally segmented into strongly cohesive chunks of structure that are associated with each other into more loose and varying combinations. There are two attitudes which an advocate of associative memory could take in response to this evident weakness. One is to see it as a component in a more complex system. The system has other mechanisms and subsystems to analyze and create complex scenes composed of rigid subpatterns that can individually be stored and retrieved in associative memory. The other attitude tries to build on the strengths of associative memory as a candidate cognitive architecture and tries to modify the
96
D. Wang, J. Buhmann, and C. von der Malsburg
model such as to incorporate the ability to segment complex input patterns into subobjects and to compose synthetic scenes from stored objects. We subscribe to this second attitude in this paper. There are three issues that we have to address. The first concerns the type of information on the basis of which pattern segmentation can be performed; second, the data structure of associative memory and attractor neural networks has to be modified by the introduction of variables that express syntactical' binding; and third, mechanisms have to be found to organize these variables into useful patterns. There are various potential sources of information relevant to segmentation. In highly structured sensory spaces, especially vision and audition, there are general laws of perceptual grouping, based on "common fate" (same pattern of movement, same temporal history), continuity of perceptual quality (texture, depth, harmonic structure), spatial contiguity, and the like. These laws of grouping have been particularly developed in the Gestalt tradition. On the other end of a spectrum, segmentation of complex patterns can be performed by just finding subpatterns that have previously been stored in memory. Our paper here will be based on this memory-dominated type of segmentation. Regarding an appropriate data structure to encode syntacticalbinding, the old proposal of introducing more neurons (e.g., a grandmother-cell to express the binding of all features that make up a complex pattern) is not a solution (von der Malsburg 1987) and produces many problems of its own. It certainly is useful to have cells that encode high-level objects, but the existence of these cells just creates more binding problems, and their development is difficult and time-consuming. We work here on the assumption (von der Malsburg 1981, 1983, 1987; von der Malsburg and Schneider 1986; Gray et al. 1989; Eckhorn et al. 1988; Damasio 1989; Strong and Whitehead 1989; Schneider 1986) that syntactical binding is expressed by temporal correlations between neural signals. The scheme requires temporally structured neural signals. A set of neurons is syntactically linked by correlating their signals in time. Two neurons whose signals are not correlated or are even anticorrelated express thereby the fact that they are not syntactically bound. There are first experimental observations to support this idea (Gray et al. 1989; Eckhorn et al. 1988). It may be worth noting that in general the temporal correlations relevant here are spontaneously created within the network and correspondingly are not stimulus-locked. As to the issue how to organize the correlations necessary to express syntactical relationships, the natural mechanism for creating correlations and anticorrelations in attractor neural networks is the exchange of excitation and inhibition. A pair of neurons that is likely to be part of one segment is coupled with an excitatory link. Two neurons that do 'We use the word syntactical structure in its original sense of arranging together, that is, grouping or binding together, and do not intend to refer to any specific grammatical or logical rule system.
Pattern Segmentation in Associative Memory
97
not belong to the same segment inhibit each other. The neural dynamics will produce activity patterns that minimize contradictions between conflicting constraints. This capability of sensory segmentation has been demonstrated by a network that expresses general grouping information (von der Malsburg and Schneider 1986; Schneider 1986). The system we are proposing here is based on associative memory, and performs segmentation exclusively with the help of the memorydominated mechanism. Our version of associative memory is formulated in a way to support attractor limit cycles (Li and Hopfield 1989; Baird 1986; Freeman et al. 1988): If a stationary pattern is presented in the input that resembles one of the stored patterns, then the network settles after some transients into an oscillatory mode. Those neurons that have to be active in the pattern oscillate in phase with each other, whereas all other neurons are silent. In this mode of operation the network has all the traditional capabilities of associative memory, especially pattern completion. When a composite input is presented that consists of the superposition of a few patterns the network settles into an oscillatory mode such that time is divided into periods in which just a single stored state is active. Each period corresponds to one of the patterns contained in the input. Thus, the activity of the network expresses the separate recognition of the individual components of the input and represents those patterns in a way avoiding confusion. This latter capability was not present in previous formulations of associative memory. The necessary couplings between neurons to induce correlations and anticorrelations are precisely those created by Hebbian plasticity. Several types of temporal structure are conceivable as basis for this mode of syntactical binding. At one end of a spectrum there are regular oscillations, in which case states would be distinguished by different phase or frequency. At the other end of the spectrum there are chaotic activity patterns (Buhmann 1989). The type of activity we have chosen to simulate here is intermediate between those extremes, being composed of intermittent bursts of oscillations (see Fig. 21, a common phenomenon in the nervous system at all levels. 2 Two Coupled Oscillators
A single oscillator i, the building block of the proposed associative memory, is modeled as a feedback loop between a group of excitatory neurons and a group of inhibitory neurons. The average activity 2, of excitatory group i and the activity yi of inhibitory group i evolve according to
(2.2)
D. Wang, J. Buhmann, and C. von der Malsburg
98
where r, and ry are the time constants of the excitatory and inhibitory components of the oscillator. An appropriate choice of r,, ry allows us to relate the oscillator time to a physiological time scale. Gz and Gy are sigmoid gain functions, which in our simulations have the form
with thresholds 8, or 8, and gain parameters l / A z and l/&. For the reaction of inhibitory groups on excitatory groups we have introduced the nonlinear function F ( x )= (1- q)x+ qx2, (0 5 q 5 l), where q parameterizes the degree of quadratic nonlinearity. This nonlinearity proved to be useful in making oscillatory behavior a more robust phenomenon in the network, so that in spite of changes in excitatory gain (with varying numbers of groups in a pattern) the qualitative character of the phase portrait of the oscillators is invariant. H, in equation 2.3 describes delayed self-inhibition of strength a and decay constant p. This is important to generate intermittant bursting. The synaptic strengths of the oscillators’ feedback loop are T,,, T , s E { x , y}. Equations 2.1 and 2.2 can be interpreted as a mean field approximation to a network of excitatory and inhibitory logical neurons (Buhmann 1989). Notice that x,, y, are restricted to [0, rZ] and [O, ry],respectively. The parameters 2 and may be used to control the average values of x and y. In addition to the interaction between x, and y,, an excitatory unit x, receives time-dependent external input I,(t) from a sensory area or from other networks, and internal input S,(t) from other oscillators. Let us examine two oscillators of type 2.1-2.3, coupled by associative connections W12, W21 as shown schematically in Figure 1. The associative interaction is given by
Sl(t)= W1222(t);
S2(t) = W21x,(t)
Two cases can be distinguished by the sign of the associative synapses. If both synapses are excitatory (“2 > 0, W21 > 0) the two oscillators try to oscillate in step, interrupted by short periods of silence due to delayed self-inhibition. A simulation of this case is shown in Figure 2a. The degree of synchronization can be quantified by measuring the correlation
C(1,2) = (21x2) - ( X d ( X 2 ) AXIAX, between the two oscillators, Ax, being the variance of x,. For the simulation shown in Figure 2a we measured C(1,2) = 0.99, which indicates almost complete phase locking. The second case, mutual inhibition between the oscillators (W12 < 0, W21 < 0), is shown in Figure 2b. The two oscillators now avoid each other, which is reflected by C(1,2) = -0.57.
Pattern Segmentation in Associative Memory
99
4 Excitatory d
Inhibitory Associative
Figure 1: Diagram of two mutually connected oscillators. Alternatively, both oscillators could be continuously active but oscillate out of phase, with 180" phase shift. That mode has been simulated successfully for the case of two oscillators and might be applied to segmentation of an object from its background; for more than two oscillators with mutual inhibition phase avoidance behavior turns out to be difficult to achieve. 3 Segmentation in Associative Memory
After this demonstration of principle we will now test the associative capabilities of a network of N oscillators connected by Hebbian rules. We store p sparsely coded, random N bit words 6'' = {.C$'}E, with pattern index v = 1,.. . , p . The probability that a bit equals 1 is a, that is, P([,Y)=
100
D. Wang, J. Buhmann, and C. von der Malsburg
aS([r - 1) + (1 - a)S([,”) with typically a < 0.2. The synapses are chosen according to the Hebbian rule
With connectivity 3.1, oscillator i receives input Si(t) = EkfiW i k x k ( t )from other oscillators. In the following simulation, 50 oscillators and 8 patterns were stored in the memory. For simplicity we have chosen patterns of equal size
b
Figure 2: (a) Simulated output pattern of two mutually excitatory oscillators. The parameter values for the two oscillators are the same, 7, = 0.9, 7, = 1.0, T,, = 1.0,T,, = 1.9, T~~= 1.3, T~~= 1.2, 77 = 0.4, A, = A, = 0.05, oz = 0.4, e, = 0.6, 11 = I, = 0.2, ct = 0.2, /3 = 0.14, 5 = y = 0.2, W12 = W21 = 2.5. Initial values: z1(0) = 0.0, z2(0) = 0.2, y1(0) = y2(0) = 0.0. The equations have been integrated with the Euler method, At = 0.01,14,000 integration steps. (b) Simulated output pattern of two mutually inhibitory oscillators. All parameters are the same as in (a), except that “12 = “21 = -0.84,o = 0.1, p = 0.26.
Pattern Segmentationin Associative Memory
101
(8 active units). The first three patterns, which will be presented to the network in the following simulation, have the form
t’
= ( l , l ,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,. . . ,O)
52
=
63
(0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0.0,1,0,.. . ,O) = ( 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0. ,.., O )
(3.2)
Notice the 25% mutual overlap among these 3 patterns and bits [y9 = 1 for patterns u = 1,2,3. With this choice of stored patterns we have tested pattern recall and pattern completion after presentation of just one incomplete pattern, the fundamental capability of associative memory. The network restored the information missing from the fragment within one or two cycles. The same behavior had been demonstrated in (Freeman et a2. 1988). A more intriguing dynamic behavior is shown by the network if we present all three patterns [I, E3 or parts of them simultaneously. In all simulations external input was time-independent but similar results can be expected for time-dependent input as used in Li and Hopfield (1989). The result of a simulation is shown in Figure 3 where the input is a superposition of patterns E l , 6’, t3 with one bit missing in each pattern (see caption of Fig. 3). In this figure only the first 19 oscillators are monitored; the others stay silent due to lack of input and mutual inhibition among oscillators representing different patterns. All three patterns are recognized, completed, and recalled by the network. In addition to the capabilities of conventional associative memory the network is able to segment patterns in time. The assembly of oscillators representing a single input pattern is oscillating in a phase-locked way for several cycles. This period is followed by a state of very low activity, during which another assembly takes turn to oscillate. In Figure 4 we have plotted the correlations between the first 19 oscillators. The oscillators in one pattern are highly correlated, that is coactive and phase-locked, whereas oscillators representing different patterns are anticorrelated. Oscillators 1, 7, and 13, which belong to two patterns each, stay on for two periods. Oscillator 19, which belongs to all three active patterns, stays on all the time. According to a number of simulation experiments, results are rather stable with respect to variation of parameters. Switching between one pattern and another can be produced either by noise, or by delayed self-inhibition (the case shown here), or by a modulation of external input. A mixture of all three is likely to be biologically relevant. The case shown here is dominated by delayed self-inhibition and has a small admixture of noise. The noise-dominated case, which we have also simulated, has an irregular succession of states and takes longer to give each input state a chance. Delayed self-inhibition might also be used in a nonoscillatory associative memory to generate switching between several input patterns. Our simulations, however, indicate that limit cycles facilitate transitions and make them more reliable.
[’,
102
D. Wang, J. Buhmann, and C. von der Malsburg
Figure 3: Simulation of an associative memory of 50 oscillators. Eight patterns have been stored in the memory and three of them, ,$I, t*,t3(3.2) are presented in this simulation simultaneously with one bit missing in each pattern. Only the output of the first 19 oscillators is shown. The others stay silent due to lack of input. The vertical dashed lines identify three consecutive time intervals with exactly one pattern active in each interval. From the result we see that at any time instant only one pattern is dominant while in a long run, all patterns have an equal chance to be recalled due to switching among the patterns. The parameter values differing from Figure 2 are Tyy= 1.0, a = 0.17, p = 0.1. We added uncorrelated white noise of amplitude 0.003 to the input to the excitatory groups. Initial value: x = 0.2(1,. . . ,l), y = (0,. . . ,O). Input: I = 0.2 ~1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,.. .,Oh
Pattern Segmentation in Associative Memory
103
Figure 4: Correlation matrix between the first 19 oscillators (cf. Fig. 3). Filled and open circles stand for positive and negative correlations, respectively. The diameter of each circle is proportional to the absolute value of the correlation.
For conceptual reasons, only a limited number of states can be represented in response to a static input. A superposition of too many (more than perhaps 10) input states leads to ambiguity and the system responds with an irregular oscillation pattern. The exact number of entities that can be represented simultaneously depends on details of implementation, but a reasonable estimate seems to be the seven plus or minus two, often cited in the psychophysical literature as the number of objects that can be held in the human attention span.
104
D. Wang, J. Buhmann, and C. von der Malsburg
4 Discussion
The point of this paper is the demonstration of a concept that allows us to compute and represent syntactical structure in a version of associative memory. Whereas in the attractor neural network view a valid state of short-term memory is a static activity distribution, we argue for a data structure based on the history of fluctuating neural signals observed over a brief time span (the time span often called "psychological moment") (Poppel and Logothetis 1986). There is ample evidence for the existence of temporal signal structure in the brain on the relevant time scale (1050 msec). Collective oscillations are of special relevance for our study here. They have been observed as local field potentials in several cortices (Gray et al. 1989; Eckhorn et al. 1988; Freeman 1978). The way we have modeled temporal signal structure, as bursts of collective oscillations, is just one possibility of many. Among the alternatives are continuous oscillations, which differ in phase or frequency between substates, and stochastic signal structure. Is the model biologically relevant? Several reasons speak for its application to sensory segmentation in olfaction. A major difficulty in applying associative memory, whether in our version or the standard one, is its inability to deal with perceptual invariances (e.g., visual position invariance). This is due to the fact that the natural topology of associative memory is the Hamming distance, and not any structurally invariant relationship. In olfaction, Hamming distance seems to be the natural topology, and for this reason associative memory has been applied to this modality before (Li and Hopfield 1989; Baird 1986; Freeman et al. 1988; Haberly and Bower 1989). Furthermore, in the simple model for segmentation we have presented here, this ability relies completely on previous knowledge of possible input patterns. In most sensory modalities general structure of the perceptual space plays an additional important role for segmentation, except in olfaction, as far as we know. Finally, due to a tradition probably started by Walter Freeman, temporal signal structure has been well studied experimentally (Freeman 1978; Haberly and Bower 1989), and has been modeled with the help of nonlinear differential equations (Baird 1986; Freeman et al. 1988; Haberly and Bower 1989). There are also solid psychophysical data on pattern segmentation in olfaction (Laing et al. 1984; Laing and Frances 1989). It is widely recognized that any new mixture of odors is perceived as a unit; but if components of a complex (approximately balanced) odor mixture are known in advance, they can be discriminated, in agreement with the model presented here. When one of the two odors dominates the other in a binary mixture, only the stronger of the two is perceived (Laing et al. 1984), a behavior we also observed in our model. How can associative memory, of the conventional kind or ours, be identified in the anatomy (Shepherd 1979; Luskin and Price 1983) of the
Pattern Segmentation in Associative Memory
105
olfactory system of mammals? In piriform cortex, pyramidal cells on the one hand and inhibitory interneurons on the other would be natural candidates for forming our excitatory and inhibitory groups of cells. They would be coupled by associative fibers within piriform cortex. Signals in stimulated olfactory cortex are oscillatory in nature (in a frequency range of 40-60 Hz) (Freeman 1978) and therefore lend themselves to this interpretation. On the other hand, also the olfactory bulb has appropriately connected populations of excitatory (mitral cells) and inhibitory (granule cells) neurons, which also undergo oscillations in the same frequency range and possibly in phase with cortical oscillations. The two populations are coupled by the lateral and medial olfactory tract in a diffuse, nontopographically ordered way. Thus a more involved implementation of associative memory in the coupled olfactory bulb-piriform cortex system is also conceivable. Our model makes the following theoretical prediction. If the animal is stimulated with a mixture of a few odors known to the animal, then it should be possible to decompose local field potentials from piriform cortex into several coherent components with zero or negative mutual correlation.
Acknowledgments This work was supported by the Air Force Office of Scientific Research (88-0274). J. B. was a recipient of a NATO Postdoctoral Fellowship (DAAD 300/402/513/9). D. L. W. acknowledges support from a n NIH grant (1ROl NS 24926, M.A. Arbib, PI).
References Baird, B. 1986. Nonlinear dynamics of pattern formation and pattern recognition in the rabbit olfactory bulb. Physica D 22, 150-175. Buhmann, J. 1989. Oscillations and low firing rates in associative memory neural networks. Phys. Rev. A 90,41454148. Damasio, A. R. 1989. The brain binds entities and events by multiregional activation from convergence zones. Neural Cornp. 1,123-132. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybernet. 60, 121-130. Freeman, W. J. 1978. Spatial properties of an EEG event in the olfactory bulb and cortex. Electroencephalogr. Clin. Neurophysiol. 44, 586-605. Freeman, W. J., Yao, Y., and Burke, B. 1988. Central pattern generating and recognizing in olfactory bulb: A correlation learning rule. Neural Networks 1.277-288.
106
D. Wang, J. Buhmann, and C. von der Malsburg
Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit intercolumnar synchronization which reflects global stimulus properties. Nature (London) 338,334-337. Haberly, L. B. and Bower, J. M. 1989. Olfactory cortex: Model circuit for study of associative memory? Trends Neural Sci. 12,258-264. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A.79,2554-2558. Laing, D. G. and Frances, G. W. 1989. The capacity of humans to identify odors in mixtures. Physiol. Behav. 46, 809-814. Laing, D. G., Panhuber, H., Willcox, M. E., and Pittman, E. A. 1984. Quality and intensity of binary odor mixtures. Physiol. Behav. 33,309-319. Li, Z . and Hopfield, J. J. 1989. Modeling the olfactory bulb and its neural oscillatory processings. Biol. Cybernet. 61,379-392. Luskin, M. B. and Price, J. L. 1983. The topographical organization of associational fibres of the olfactory system in the rat, including centrifugal fibres to the olfactory bulb. I. Comp. Neurol. 216, 264-291. Poppel, E., and Logothetis, N. 1986. Neuronal oscillations in the human brain. Naturwissenschaften 73, 267-268. Schneider, W. 1986. Anwendung der Korrelationstheorie der Hirnfunktion auf das akustische Figur-Hintergrund-Problem(Cocktailparty-Effekt).Dissertation, University of Gottingen. Shepherd, G. M. 1979. The Synaptic Organization of the Brain. Oxford University Press, New York. Steinbuch, K. 1961. Die Lernmatrix. Kybernefik 1,36-45. Strong, G. W. and Whitehead, B. A. 1989. A solution to the tag-assignment problem for neural networks. Behav. Brain Sci. 12, 381433. von der Malsburg, C. 1981. The Correlation Theory of Bruin Function. Internal Report 81-2, Abteilung fiir Neurobiologie, MPI fiir Biophysikalische Chemie, Gottingen. von der Malsburg, C. 1983. How are nervous structures organized? In Synergetics of the Brain. Proceedings of the lnternational Symposium on Synergetics, May 2983, E. Bagar, H. Flohr, H. Haken, and A. J. Mandell, eds. Springer, Berlin, Heidelberg, pp. 238-249. von der Malsburg, C. 1987. Synaptic plasticity as basis of brain organization. In The Neural and Molecular Bases of Learning, Dahlem Konferenzen, J.-P. Changew, and M. Konishi, eds. Wiley, Chichester, pp. 411431. von der Malsburg, C. and Schneider, W. 1986. A neural cocktail-party processor. Bid. Cybernet. 54,29-40. Willshaw, D. J., Buneman, 0. P., and Longuet-Higgins, H. C. 1969. Nonholographic associative memory. Nature (London) 222, 960-962.
Received 9 August 1989; accepted 9 January 1990.
Communicated by David S. Touretzky
A Neural Net Associative Memory for Real-Time Applications Gregory L. Heileman Department of Computer Engineering, University of Central Florida, Orlando, FL 32816 USA
George M. Papadourakis Department of Computer Science, University of Crete, h k l i o n , Crete, Greece
Michael Georgiopoulos Department of Electrical Engineering, University of Central Florida, Orlando, FL 32816 USA
A parallel hardware implementation of the associative memory neural network introduced by Hopfield is described. The design utilizes the Geometric Arithmetic Parallel Processor (GAPP), a commercially available single-chip VLSI general-purpose array processor consisting of 72 processing elements. The ability to cascade these chips allows large arrays of processors to be easily constructed and used to implement the Hopfield network. The memory requirements and processing times of such arrays are analyzed based on the number of nodes in the network and the number of exemplar patterns. Compared with other digital implementations, this design yields significant improvements in runtime performance and offers the capability of using large neural network associative memories in real-time applications. 1 Introduction Data stored in an associative memory are accessed by their contents. This is in contrast to random-access memory (RAM) in which data items are accessed according to their address. The ability to retrieve data by association is a very powerful technique required in many high-volume information processing applications. For example, associative memory has been used to perform real-time radar tracking in an antiballistic missile environment. They have also been proposed for use in database applications, image processing, and computer vision. A major advantage that associative memory offers over RAM is the capability of rapidly retrieving data through the use of parallel search and comparison operations; however, this is achieved at some cost. The ability to search the contents Neural Computation 2, 107-115 (1990) @ 1990 Massachusetts Institute of Technology
108
G.L. Heileman, G.M. Papadourakis, and M. Georgiopoulos
of a traditional associative memory in a fully parallel fashion requires the use of a substantial amount of hardware for control logic. Until recently, the high cost of implementing associative processors has mainly limited their use to special purpose military applications (Hwang and Briggs 1984). However, advances in VLSI technology have improved the feasibility of associative memory systems. The Hopfield neural network has demonstrated its potential as an associative memory (Hopfield 1982). The error correction capabilities of this network are quite powerful in that it is able to retrieve patterns from memory using noisy or partially complete input patterns. Koml6s and Paturi (1988), among others, have recently performed an extensive analysis of this behavior as well as the convergence properties and memory capacity of the Hopfield network. Due to the massive number of nodes and interconnections in large neural networks, real-time systems will require computational facilities capable of exploiting the inherent parallelism of neural network models. Two approaches to the parallel hardware implementation of neural networks have been utilized. The first involves the development of special-purpose hardware designed to specifically implement neural network models or certain classes of neural network models (Alspector et al. 1989; Kung and Hwang 1988). Although this approach has been shown to yield tremendous speedups when compared to sequential implementations, the specialized design limits the use of such computers to neural network applications and consequently limits their commercial availability. This is in contrast to the second approach to parallel hardware implementation, general-purpose parallel computers, which are designed to execute a variety of different applications. The fact that these computers are viable for solving a wide range of problems tends to increase their availability while decreasing their cost. In this paper a direct, parallel, digital implementation of a Hopfield associative memory neural network is presented. The design utilizes the first general-purpose commercially produced array processor chip, the Geometric Arithmetic Parallel Processor (GAPP) developed by the NCR Corporation in conjunction with Martin Marietta Aerospace. Using these low-cost VLSI components, it is possible to build arbitrarily sized Hopfield networks with the capability of operating in real-time. 2 The GAPP Architecture
The GAPP chip is an inexpensive two-dimensional VLSI array processor that has been utilized in such applications as pattern recognition, image processing, and database management. Current versions of the GAPP operate at a 10-MHz clock cycle; however, future versions will utilize a 20-MHz clock cycle (Brown and Tomassi 1989). A single GAPP chip contains a mesh-connected 6 by 12 arrangement of processing elements
A Neural Net Associative Memory for Real-Time Applications
109
(PEs). Each PE contains a bit-serial ALU, 128 x 1 bits of RAM, 4 singlebit latches and is able to communicate with each of its four neighbors. GAPP chips can be cascaded to implement arbitrarily sized arrays of PEs (in multiples of 6 x 12). This capability can be used to eliminate bandwidth limitations inherent in von Neumann machines. For example, a 48 x 48 PE array (32 GAPP chips) can read a 48-bit-wide word every 100 nsec, yielding an effective array bandwidth of 480 Mbits/sec (Davis and Thomas 1988; NCR Corp. 1984). Information can be shifted into the GAPP chip from any edge. Therefore, the ability to shift external data into large GAPP arrays is limited only by the number of data bus lines available from the host processor. For example, Martin Marietta Aerospace is currently utilizing a 126,720 PE array (1760 GAPP chips) in image processing applications. This system is connected to a Motorola MC68020 host system via a standard 32-bit Multibus (Brown and Tomassi 1989). 3 The Hopfield Neural Network
The Hopfield neural network implemented here utilizes binary input patterns - example inputs are black and white images (where the input elements are pixel values), or ASCII text (where the input patterns are bits in the 8-bit ASCII representation). This network is capable of recalling one of A4 exemplar patterns when presented with an unknown N element binary input pattern. Typically, the unknown input pattern is one of the M exemplar patterns corrupted with noise (Lippmann 1987). The recollection process, presented in Figure 1, can be separated into two distinct phases. In the initialization phase, the M exemplar patterns are used to establish the N 2 deterministic connection weights, t i j . In the search phase, an unknown N element input pattern is presented to the N nodes of the network. The node values are then multiplied by the connection weights to produce the new node values. These node values are then considered as the new input and altered again. This process continues to iterate until the input pattern converges. 4 Hopfield Network Implementation on the GAPP
Our design maps each node in the Hopfield network to a single PE on GAPP chips. Thus, an additional GAPP chip must be incorporated into the design for every 72 nodes in the Hopfield network. The ease with which these chips are cascaded allows such an approach to be used. When implementing the Hopfield network, the assumption is made that all M exemplar patterns are known a priori. Therefore, the initialization phase of the recollection process is performed off-line on the host computer. The resulting connection weights are downloaded, in signed magnitude format, to the PEs’ local memory as bit planes. The local
G.L. Heileman, G.M. Papadourakis, and M. Georgiopoulos
110
Let = number of exemplar patterns = number of elements in each exemplar pattern z," = element i of exemplar for pattern s = fl = element i of unknown input pattern = *1 ya u,(k) = output of node i after k iterations = interconnection weight from node i to node J' tij
M N
Initialization:
M
CX
~
~
X
if~ if
~ j ( 0=) yi,
Search:
~i + ,
j
i=j
15 i 5 N
, N
where
iterate until
u3(k + 1)= U j ( k ) ,
15 j 5 N
Figure 1: The recollection process in a Hopfield neural network.
memory of the PEs is used to store the operands of the sum of products operations required in the search phase. The memory organization of a PE (node j) is illustrated in Figure 2. For practical applications, the GAPP memory is insufficient for storing all weights concurrently, thus segmentation is required. The Hopfield network is implemented in parallel with each PE performing N multiplications and ( N - 1)additions per iteration. However, in practice no actual multiplications need occur since the node values are either +1 or -1. Therefore, multiplications are implemented by performing an exclusive-OR operation on the node bit plane and the sign bit plane of the weights. The result replaces the weights' sign bit plane. These results are then summed and stored in the GAPP memory. The sign bit plane of the summations represents the new node values.
A Neural Net Associative Memory for Real-Time Applications
111
After an iteration has been completed, the input pattern is tested for convergence utilizing the global OR function of the GAPP chips. If the result of the global OR is 1, another iteration is required; thus, it is necessary to transfer the new node values (i.e., the sign bit of the summation) to the host machine. These node values are then downloaded, along with the connection weights, to the GAPP chips in the manner described previously and another iteration is performed. 5 Memory Requirements and Processing Time
The number of bits required to store each weight value and the summation in the search phase are w = rlog2(M+1)1+1and p = [log2(NM+1)1+1, respectively, where N is the number of nodes in the network and M represents the number of exemplar patterns. Therefore, each PE in the GAPP array has a total memory requirement of N ( w + 1) + p (see Fig. 2).
tlj U1
-4 w bits
1 bit
t2j 212
tkj uk
Ctijui 1
p bits
Figure 2: Organization of a single PEs memory in the Hopfield neural network implementation on the GAPP.
G.L. Heileman, G.M. Papadourakis, and M. Georgiopoulos
112
If we let B denote the size of a single PEs memory, then each PE has ( B - p ) bits available for storing weight and node values. If N ( w + 1) > B -p, there is not enough GAPP memory to store all of the weights at one time, and weights must be shifted into the GAPP memory in segments. The number of weights in each of these segments is given by
while the total number of segments is given by
Letting C represent the number of clock cycles needed to shift a bit plane into GAPP memory, then the number of clock cycles required to download weight and node vaiues to GAPP memory, and to upload new node values to the host is L
=
[SD(w+1)+2lC - 1
Furthermore, C depends on the number of data bus lines available from the host and the number of GAPP chips, n. In particular, C can be expressed as
1
1
C = 12 6n/# data lines + 1 The processing time required to implement the search phase of the Hopfield network on the GAPP chips is formulated below. The implementation involves four separate steps. First, the D weights stored in GAPP memory are multiplied by the appropriate node values. As discussed previously, this is performed using an exclusive-OR operation; such an operation requires 3 0 GAPP clock cycles. The second step involves converting the modified weight values into two's complement format; this processing requires D(4w - 1) clock cycles. Next, the D summations required by the search phase are implemented; this can be accomplished in 3Dp clock cycles. Finally, 4 clock cycles are required to test for input convergence. The total processing time can now be expressed as P
=
S[3D + D(4w - 1)+ 3Dp + 41 clock cycles
and the total time required to perform a single iteration of the search phase of the Hopfield network is T
=L
+P
= SD[C(W+ 1)+ 3 ( p + 1)+ ( 4 -~111
+ 4 s + 2C - 1 clock cycles
A Neural Net Associative Memory for Real-Time Applications
113
6 Comparisons and Experimental Results
A comparison of the results obtained in the previous section with other digital implementations of the Hopfield network (Na and Glinski 1988) is illustrated in Figure 3. The curve for the DEC PDP-11/70 can be considered a close approximation for the number of clock cycles required by other sequential processing (von Neumann) architectures. Also, the curve for the GAPP PEs assumes the use of a standard 32-bit bus. All of the curves in the figure are plotted with the assumption that 111 = 10.15N1. As more nodes are added, the number of clock cycles required to process the data on the PDP-11/70 and Graph Search Machine (GSM) increases much more rapidly than it does on the GAPP PEs; this can be attributed to the high degree of fine-grained parallelism employed by the GAPP processors when executing the Hopfield algorithm. For example, when implementing a 360 node network, this design requires 7 msec to perform a single iteration. Extrapolation of the curves in Figure 3 also indicates that for large networks, the ability to implement the network in parallel will easily outstrip any gains achieved by using a faster clock cycle on a sequential processing computer. For example, executing Hopfield networks on the order of 100,000 nodes yields an approximate 132-fold speedup over a sequential implementation. Therefore, a sequential computer with a clock frequency twice as fast as that of the GAPP will still be 66 times slower than the Hopfield network implementation on GAPP processors. In terms of connections per second (CPS), the 126,720 PE GAPP array discussed earlier can deliver approximately 19 million CPS while running at 10 MHz. The same array running at 20 MHz would yield nearly 38 million CPS, where CPS is defined as the number of multiplyand-accumulate operations that can be performed in a second. In this case, the CPS is determined by dividing the total number of connections by the time required to perform a single iteration of the Hopfield algorithm (the time required to shift in weight values from the host, and the time required to perform the symmetric hard limiting function, f h , are also included). These results compare favorably to other more costly general-purpose parallel processing computers such as a Connection Machine, CM-2, with 64 thousand processors (13 million CPS), a 10-processor WARP systolic array (17 million CPS), and a 64-processor Butterfly computer (8 million CPS). It should be noted, however, that the CPS measure is dependent on the neural network algorithm being executed. Therefore, in terms of comparison, these figures should be considered only as rough estimates of performance (Darpa study 1988). To verify the implementation of the Hopfield network presented in Section 4, and the analysis presented in Section 5, a 12 x 10 node Hopfield network was successfully implemented on a GAPP PC development system using the GAL (GAPP algorithm language) compiler. The exemplar patterns chosen were those used by Lippmann et al. (1987) in their
G.L. Heileman, G.M. Papadourakis, and M. Georgiopoulos
114
t I
In
"0
50
loo
150
200
250
m
350
4co
Figure 3: Number of clock cycles required to implement a single iteration of the Hopfield network (search phase) on a PDP-11/70, the Graph Search Machine and GAPP processors. Because of the explosive growth rates of the PDP-11/70 and GSM curves, this graph displays GAPP results for only a relatively small number of nodes. However, the analysis presented here is valid for arbitrarily large networks. character recognition experiments. The implementation of these experiments in fact corroborated the predicted results.
Acknowledgments This research was supported by a grant from the Division of Sponsored Research at the University of Central Florida.
References Alspector, J., Gupta, B., and Allen, R. B. 1989. Performance of a stochastic learning microchip. In Advances in Neural Information Processing Systems 1, D. S. Touretzky, ed. Morgan Kaufmann, San Mateo, CA.
A Neural Net Associative Memory for Real-Time Applications
115
Brown, J. R. and Tommasi, M. 1989. Martin Marietta Electronic Systems, Orlando, FL. Personal communication. Darpa neural network study. 1988. B. Widrow, Study Director. AFCEA International Press. Davis, R. and Thomas, D. 1988. Systolic array chip matches the pace of highspeed processing. Electronic Design, October. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Hwang, K. and Briggs, F. 1984. Computer Architecture and Parallel Processing. McGraw-Hill, New York. Komlos, J. and Paturi, R. 1988. Convergence results in an associative memory model. Neural Networks 1, 239-250. Kung, S. Y. and Hwang, J. N. 1988. Parallel architectures for artificial neural nets. In Proceedings of the IEEE International Conference on Neural Networks, Vol. 11, San Diego, CA, pp. 165-172. Lippmann, R. P. 1987. An introduction to computing with neural nets. I E E E Acoustics Speech Signal Proc. Mag. 4(2), 4-22. Lippmann, R. P., Gold, B., and Malpass, M. L. 1987. A Comparison of Hamming and Hopfield Neural Nets for Pattern Classification. Tech. Rep. 769, M.I.T., Lincoln Laboratory, Lexington, MA. Na, H. and Glinski, S. 1988. Neural net based pattern recognition on the graph search machine. In Proceedings of the IEEE International Conferenceon Acoustics Speech and Signal Processing, New York. NCR Corp., Dayton, Ohio. 1984. Geometric arithmetic parallel processor (GAPP) data sheet.
Received 5 June 1989; accepted 2 October 1989.
Communicated by John Moody
Gram-Schmidt Neural Nets Sophocles J. Orfanidis Department of Electrical and Computer Engineering, Rutgers University, Piscataway, NJ 08855 USA
A new type of feedforward multilayer neural net is proposed that exhibits fast convergence properties. It is defined by inserting a fast adaptive Gram-Schmidt preprocessor at each layer, followed by a conventional linear combiner-sigmoid part which is adapted by a fast version of the backpropagation rule. The resulting network structure is the multilayer generalization of the gradient adaptive lattice filter and the Gram-Schmidt adaptive array. 1 Introduction
In signal processing language, a feedforward multilayer neural net adapted by the backpropagation rule (Rumelhart et al. 1986; Werbos 1988; Parker 1987) is the multilayer generalization of the adaptive linear combiner adapted by the Widrow-Hoff LMS algorithm (Widrow and Stearns 1985). The backpropagation rule inherits the computational simplicity of the LMS algorithm. But, like the latter, it often exhibits slow speed of convergence. The convergence properties of the LMS algorithm are well known (Widrow and Stearns 1985). Its learning speed depends on the correlations that exist among the components of the input vectors -the stronger the correlations, the slower the speed. This can be understood intuitively by noting that, if the inputs are strongly correlated, the combiner has to linearly combine a lot of redundant information, and thus, will be slow in learning the statistics of the input data. On the other hand, if the input vector is decorrelated by a preprocessor prior to the linear combiner, the combiner will linearly combine only the nonredundant part of the same information, thus, adapting faster to the input. Such preprocessor realizations of the adaptive linear combiner lead naturally to the fast Gram-Schmidt preprocessors of adaptive antenna systems and to the adaptive lattice filters of time-series problems (Widrow and Stearns 1985; Monzingo and Miller 1980; Compton 1988; Orfanidis 1988). In this paper, we consider the generalization of such preprocessor structures to multilayer neural nets and discuss their convergence properties. The proposed network structure is defined by inserting, at each layer of the net, a Gram-Schmidt preprocessor followed by the convenNeural Computation 2,116-126 (1990) @ 1990 Massachusetts Institute of Technology
Gram-Schmidt Neural Nets
117
tional linear combiner and sigmoid parts. The purpose of each preprocessor is to decorrelate its inputs and provide decorrelated inputs to the linear combiner that follows. Each preprocessor is itself realized by a linear transformation, but of a speciaI kind, namely, a unit lower triangular matrix. The weights of the preprocessors are adapted locally at each layer, but the weights of the linear combiners must be adapted by the backpropagation rule. We discuss a variety of adaptation schemes for the weights, both LMS-like versions and fast versions. The latter are, in some sense, implementations of Newton-type methods for minimizing the performance index of the network. Newton methods for neural nets have been considered previously (Sutton 1986; Dahl 1987; Watrous 1987; Kollias and Anastassiou 1988; Hush and Salas 1988; Jacobs 1988). These methods do not change the structure of the network - only the way the weights are adapted. They operate on the correlated signals at each layer, whereas the proposed methods operate on the decorrelated ones. 2 Gram-Schmidt Preprocessors
In this section, we summarize the properties of Gram-Schmidt preprocessors for adaptive linear combiners. Our discussion is based on Orfanidis (1988). The correlations among the components of an ( M +1)-dimensional input vector x = [xo,q ,. . . , x ~ ] *are contained in its covariance matrix R = E [ x x r ] , where E [ ] denotes expectation and the superscript T transposition. The Gram-Schmidt orthogonalization procedure generates a new basis z = [zo,z1, . . . ,Z M ] * with mutually uncorrelated components, that is, E [ z , z , ] = 0 for i # j . It is defined by starting at zo = 2 0 and proceeding recursively for i = 1,2,. . . , M
where the coefficients bij are determined by the requirement that zi be decorrelated from all the previous zs { z o , zl, . . . ,zipl}.These coefficients define a unit lower triangular matrix B such that X =
Bz
(2.2)
known as the innovations representation of x. For example, if A4 = 3,
Sophocles J. Orfanidis
118
Figure 1: (a) Gram-Schmidt preprocessor. (b) Elementary building block. Equation 2.1 is shown in Figure 1. It represents the solution of the lower triangular linear system 2.2 by forward substitution. The covariance matrix of z is diagonal, Z, =
E[zzT]= diag{&o,E~, ..., E M )
where &, = E[z,2].It is related to R by R = BVBT which is recognized as the Cholesky factorization of R. Thus, all the correlation properties of x are contained in B , whereas the essential, nonredundant, information in x is contained in the uncorrelated innovations vector z. For the adaptive implementations, it proves convenient to cast the Gram-Schmidt preprocessor as a prediction problem with a quadratic performance index, iteratively minimized using a gradient descent scheme. Indeed, an equivalent computation of the optimal weights bZj is based on the sequence of minimization problems
E, = ~tz,2I = min,
i = 1 , 2 , .. . , M
(2.3)
where, for each i, the minimization is with respect to the coefficients b,, j = 0,1,. . . , i - 1. Each z, may be thought of as the prediction error in predicting 2, from the previous zs, or equivalently, the previous m. The decorrelation conditions E[z,z,] = 0 are precisely the orthogonal-
ity conditions for the minimization problems (2.3). The gradient of the
Gram-Schmidt Neural Nets
119
performance index ri is a&,/ab,, = -2E[z,z,]. Dropping the expectation value (and a factor of two), we obtain the LMS-like gradient-descent delta rule for updating b,,
where p is a learning rate parameter. A faster version, obtained by applying Newton's method to the decorrelated basis, is as follows (Orfanidis 1988):
(2.5) where ,B is usually set to 1 and E., is a time-average approximation to E, = E[z:] updated from one iteration to the next by
E, = XE,
-I-z;
(2.6)
where X is a "forgetting" factor with a typical value of 0.9. Next, we consider the Gram-Schmidt formulation of the adaptive linear combiner. Its purpose is to generate an optimum estimate of a desired output vector d by the linear combination y = W x , by minimizing the mean square error I = ~ [ e ~= emin ]
(2.7)
where e = d - y is the estimation error. The output y may also be computed in the decorrelated basis z by
y = W X= Gz
(2.8)
where G is the combiner's weight matrix in the new basis, defined by
WB=G
+-
W=GB-'
(2.9)
The conventional LMS algorithm is obtained by considering the performance index (2.7) to be a function of the weight matrix W . In this case, the matrix elements of W are adapted by Aw,, = p w ,
Similarly, viewing the performance index as a function of G and carrying out gradient descent with respect to G, we obtain the LMS algorithm for adapting the matrix elements of G
As%.,= W J ,
(2.10)
A faster version is (2.11)
Sophocles J. Orfanidis
120
with El adapted by 2.6. Like 2.5, it is equivalent to applying Newton's method with respect to the decorrelated basis. Conceptually, the adaptation of B has nothing to do with the adaptation of G, each being the solution to a different optimization problem. However, in practice, B and G are simultaneously adapted using equations 2.4 and 2.10, or their fast versions, equations 2.5 and 2.11. In 2.11, we used the scale factor pp instead of p to allow us greater flexibility when adapting both b,, and gz,. 3 Gram-Schmidt Neural Nets
In this section, we incorporate the Gram-Schmidt preprocessor structures into multilayer neural nets and discuss various adaptation schemes. Consider a conventional multilayer net with N layers and let un,xn-denote the input and output vectors at the nth layer and W" the weight matrix connecting the nth and (n + 1) layers, as shown in Figure 2. The overall input and output vectors are #, x N . The operation of the network is described by the forward equations: For R = 0,1,.. . , N - 1 Un+l
,p+l
wnxn
- f(u"+')
Figure 2: (a) Conventional net. (b) Gram-Schmidt net.
Gram-Schmidt Neural Nets where we denote f(u) = [f(uo), f ( u l ) , . . .IT if u = moidal function is defined by
121 [ U O ,U I ,
. . .IT.
The sig-
The performance index of the network is
1
&= (d-xN)T(d-xN) 2 patterns
(3.3)
For each presentation of a desired input/output pattern {xo, d } , the backpropagation rule (Rumelhart et al. 1986; Werbos 1988; Parker 1987) computes the gradients e" = -a&/au" by starting at the output layer
eN = D N ( d- x N ) and proceeding backward to the hidden layers, for R = N - 1,N =~
n ~ ~ i T ~ n + l
(3.4) -
2, . . . ,1 (3.5)
where D" = diag{f'(u")} is the diagonal matrix of derivatives of the sigmoidal function, and f' = f ( 1 - f ) . The weights W" are adapted by aw,; = pLe,"+'L;
(3.6)
or, by the "momentum" method (Rumelhart et al. 1986)
Awii
= aAwC
+ pe;+lx;
(3.7)
where a plays a role analogous to the forgetting factor X of the previous section. The proposed Gram-Schmidt network structure is defined by inserting an adaptive Gram-Schmidt preprocessor at each layer of the network. Let z" be the decorrelated outputs at the nth layer and B'l the corresponding Gram-Schmidt matrix, such that by equation 2.2, X" = Bnzn,and let G'" be the Iinear combiner matrix, as shown in Figure 2. It is related to W" by equation 2.9, W"BrL= GI1,which implies that Wnxn= Gnzn. The forward equations 3.1-3.2 are replaced now by (3.8) (3.9) (3.10) where 3.8 is solved for Z" by forward substitution, as in 2.1. Inserting W" = G"(B"1-I in the backpropagation equation 3.5, we obtain en = DnWnTen+'= Dn(BnT)-lGnTen+l. To facilitate this computation, define the intermediate vector f" = (BrLT)-1G7LTerz+1 or, BrLTfz = GnTe"+' Then, equation 3.5 can be replaced by the pair (3.11) (3.12)
Sophocles J. Orfanidis
122
where, noting that BnT is an upper triangular matrix, equation 3.11 may be solved efficiently for fn using backward substitution. Using 2.4, the adaptation equations for the b-weights are given by Ab; = pzrzj” (3.13) or, by the fast version based on 2.5 (3.14)
with E,” updated from one iteration to the next by ET = XEj”+ (27)‘
(3.15)
Similarly, the adaptation of the g-weights is given by Ag; = peq+’z,”
(3.16)
and its faster version based on 2.11 (3.17)
Momentum updating may also be used, leading to the alternative adaptation (3.18) Ag; = aAgG + pea+’z; and its faster version Ag;
= CuAg;
PP + -eE?
n+l n a
(3.19)
zj
The complete algorithm consists of the forward equations 3.8-3.10, the backward equations 3.4 and 3.11,3.12, and the adaptation equations 3.13 and 3.16, or the faster versions, equations 3.14, 3.15, and 3.17.
4 Simulation Results In this section, we present some simulations illustrating the performance of the proposed network structures. Consider two network examples, the first is a 3:3:2 network consisting of three input units, two output units, and one hidden layer with three units, and the second is a 3:6:2 network that has six hidden units. We choose a set of eight input/output training patterns given by 0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
0 1 1 0 1 0 0
1 0 0 1 0 1 1
1 0
Gram-Schmidt Neural Nets
123
They correspond to the 3-input parity problem in which the first of the two outputs is simply the parity of the three inputs and the second output is the complement of the first output. Figure 3 shows the performance index (3.2) versus iteration number for the conventional and Gram-Schmidt nets, where each iteration represents one epoch, that is, the presentation of all eight patterns in sequence. In the first two graphs, the linear combiner weights were adapted on an epoch basis, that is, accumulating equation 3.6 in the conventional case or equation 3.17 in the Gram-Schmidt case, over all eight patterns and then updating. The last two graphs correspond to momentum or pattern updating, that is, using equations 3.7 and 3.19 on a pattern basis. The b-weights were adapted only on a pattern basis using the fast method, equations 3.14 and 3.15. The following values of the parameters were used: p = 0.25, 0 = 1, X = cy = 0.85. The same values of ,u and GY were used in both the conventional and Gram-Schmidt cases. To avoid possible divisions by zero in equations 3.14 and 3.17, the quantities E? were initialized to some small nonzero value, EY = 6, typically 6 = 0.01. The algorithm is very insensitive to the value of 6.Also,
3 : 6 : 2 , epoch updating
3 3 . 2 . epoch updoting I
0.71
0.6} 0 5 -------_.___
LJ
O 03 . h
-0 1 1
-0 1
0
200
+OO
600
800
0
1000
200
400
600
800
,terat,ons
iterations
3.3 2, pottern u p d a t l n g
3 6 2 , pottern updatinq
0.7,
-0 I 1
0
I 100
200
300
100
0
500
100
200
300
,terat,ons
lterotlons
-
Figure 3: Learning curves of conventional and Gram-Schmidt nets.
LOO
Sophocles J. Orfanidis
124
3:3:2
3:6:2
Epoch Pattern Epoch Pattern
Conventional
2561
1038
1660
708
Gram-Schmidt
923
247
349
100
Table 1: Average Convergence Times bias terms in equations 3.9 were incorporated by extending the vector Z" by an additional unit which was always on. It has been commonly observed in the neural network literature that there is strong dependence of the convergence times on the initial conditions. Therefore, we computed the average convergence times for the above examples based on 200 repetitions of the simulations with random initializations. The random initial weights were chosen using a uniform distribution in the range [-1,11. The convergence time was defined as the number of iterations for the performance index (3.3) to drop below a certain threshold value - here, Emax= 0.01. The average convergence times are shown in Table 1. The speed advantage of the Gram-Schmidt method is evident. 5 Discussion
Convergence proofs of the proposed algorithms are straightforward only in the partially adaptive cases, that is, adapting B" with fixed G" or adapting G" with fixed B". In the latter case, it is easily shown that 3.16, in conjunction with the backpropagation equations 3.11,3.12, implements gradient descent with respect to the g-weights. When B" and G" are simultaneously adaptive, convergence proofs are not available, not even for the single-layer adaptive combiners that are widely used in signal processing applications. Although we have presented here only a small simulation example, we expect the benefits of the Gram-Schmidt method to carry over to larger neural net problems. The convergence rate of the LMS algorithm for an ordinary adaptive linear combiner is controlled by the eigenvalue spread, Xmax/X-, of the input covariance matrix R = E[x%''~]. The Gram-Schmidt preprocessors achieve faster speed of convergence
Gram-Schmidt Neural Nets
125
by adaptively decorrelating the inputs to the combiner and equalizing the eigenvalue spread - the relative speed advantage being roughly proportional to Xmax/Xmin. In many applications, such as adaptive array processing, as the problem size increases so does the eigenvalue spread, thus, making the use of the Gram-Schmidt method more effective. We expect the same behavior to hold for larger neural network problems. A guideline whether the use of the Gram-Schmidt method is appropriate for any given neural net problem can be obtained by computing the eigenvalue spread of the covariance matrix of the input patterns xo:
R=
xoxoT patterns
If the eigenvalue spread is large, the Gram-Schmidt method is expected to be effective. For our simulation example, it is easily determined that the corresponding eigenvalue spread is Xmax/Xmin = 4. The results of Table 1 are consistent with this speed-up factor.
References Compton, R. T. 1988. Adaptive Antennas. Prentice-Hall, Englewood Cliffs, NJ. Dahl, E. D. 1987. Accelerated learning using the generalized delta rule. Proc. IEEE First Int. Conf. Neural Networks, San Diego, p. 11-523. Hush, D. R. and Salas, J. M. 1988. Improving the learning rate of back-propagation with the gradient reuse algorithm. Proc. I E E E Int, Conf. Neural Networks, San Diego, p. 1-441. Jacobs, R. A. 1988. Increased rates of convergence through learning rate adaptation. Neural Networks, 1, 295. Kollias, S. and Anastassiou, D. 1988. Adaptive training of multilayer neural networks using a least squares estimation technique. Proc. I E E E Int. Conf. Neural Networks, San Diego, p. 1-383. Monzingo, R. A. and Miller, T. W. 1980. Introduction to Adaptive Arrays, Wiley, New York. Orfanidis, S. J. 1988. Optimum Signal Processing, 2nd ed., McGraw-Hill, New York. Parker, D. B. 1987. Optimal algorithms for adaptive networks: Second order back propagation, second order direct propagation, second order Hebbian learning. Proc. IEEE First Int. Conf. Neural Networks, San Diego, p. 11-593, and earlier references therein. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representationsby error propagation. In Parallel Distributed Processing, Vol. 1, D. E. Rumelhart and J. L. McClelland, eds. MIT Press, Cambridge, MA. Sutton, R. S. 1986. Two problems with back propagation and other steepestdescent learning procedures for networks. Proc. 8th Ann. Conf. Cognitive Sci. SOC.,p. 823. Watrous, R. L. 1987. Learning algorithms for connectionist networks: Applied gradient methods of nonlinear optimization. Proc. ZEEE First Int. Conf. Neural Networks, San Diego, p. 11-619.
126
Sophocles J. Orfanidis
Werbos, l? J. 1988. Backpropagation: Past and future. Proc. IEEE lnt. Conf. Neural Networks, San Diego, p. 1-343,and earlier references therein. Widrow, B. and Steams, S. D. 1985. Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs, NJ.
Received 10 July 1989; accepted 13 November 1989.
127
Errata
In Halbert White’s “Learning in Artificial Neural Networks: A Statistical Perspective” (1:447), step one was omitted from a discussion of the multilevel single linkage algorithm of Rinnooy Kan et al. (1985). The step is:
1. Draw a weight vector w from the uniform distribution over w.
The artwork for figures 2 and 3 in ”Backpropagation Applied to Handwritten Zip Code Recognition,” by Y. LeCun et al. (1:545 and 548) was transposed. The legends were placed correctly.
Communicated by Dana Ballard
Visual Perception of Three-Dimensional Motion David J. Heeger* The Media Laboratory, Massachusetts Institute of Technology, Cambridge, M A 02139 U S A
Allan Jepson Computer Science Department, University of Toronto, Toronto, Ontario M5S 1A4, Canada
As an observer moves and explores the environment, the visual stimulation in his eye is constantly changing. Somehow he is able to perceive the spatial layout of the scene, and to discern his movement through space. Computational vision researchers have been trying to solve this problem for a number of years with only limited success. It is a difficult problem to solve because the relationship between the optical-flow field, the 3D motion parameters, and depth is nonlinear. We have come to understand that this nonlinear equation describing the optical-flow field can be split by an exact algebraic manipulation to yield an equation that relates the image velocities to the translational component of the 3D motion alone. Thus, the depth and the rotational velocity need not be known or estimated prior to solving for the translational velocity. The algorithm applies to the general case of arbitrary motion with respect to an arbitrary scene. It is simple to compute and it is plausible biologically. 1 Introduction
Almost 40 years ago, Gibson (1950) pointed out that visual motion perception is critical for an observer's ability to explore and interact with his environment. Since that time, perception of motion has been studied extensively by researchers in the fields of visual psychophysics, visual neurophysiology, and computational vision. It is now well-known that the visual system has mechanisms that are specifically suited for analyzing motion. In particular, human observers are capable of recovering accurate information about the translational component of three-dimensional motion from the motion in images (Warren and Hannon 1988). 'Current address: NASA-Ames Research Center, mail stop 262-2, Moffett Field, CA 94035 USA. Nerrrai Compufafion 2, 129-137 (1990) @ 1990 Massachusetts Institute of Technology
David J. Heeger and Allan Jepson
130
The first stage of motion perception is generally believed to be the measurement of image motion, or optical flow, a field of two-dimensional velocity vectors that encodes the direction and speed of displacement for each small region of the visual field. A variety of algorithms for computing optical flow fields have been proposed by a number of computational vision researchers (e.g., Horn and Schunk 1981; Anandan 1989; Heeger 1987). The second stage of motion perception is the interpretation of optical flow in terms of objects and surfaces in the three-dimensional world. As an observer (or camera) moves with respect to a rigid scene (object or surface), the image velocity at a particular image point depends nonlinearly on three quantities: the translational velocity of the observer relative to a point in the scene, the relative rotational velocity between the observer and the point in the scene, and the distance from the observer to the point in the scene. This paper presents a simple algorithm for recovering the translational component of 3D motion. The algorithm requires remarkably little computation; it is straightforward to design parallel hardware capable of performing these computations in real time. The mathematical results in this paper have direct implications for research on biological motion perception. 2 3D Motion and Optical Flow
We first review the physics and geometry of instantaneous rigid-body motion under perspective projection, and derive an equation relating 3D motion to optical flow. Although this equation has been derived previously by a number of authors (e.g., Longuet-Higgins and Prazdny 1980; Bruss and Horn 1983; Waxman and Ullman 19851, we write it in a new form that reveals its underlying simplicity. Each point in a scene has an associated position vector, P = ( X ,Y,Z)t, relative to a viewer-centered coordinate frame as depicted in Figure 1. Under perspective projection this surface point projects to a point in the image plane, (z, Y ) ~ , fX/Z
x
=
Y
= fYlZ
(2.1)
where f is the "focal length of the projection. Every point of a rigid body shares the same six motion parameters relative to the viewer-centered coordinate frame. Due to the motion of the observer, the relative motion of a surface point is
v=
(
dX d Y d Z dt ' dt ' dt
) =-(nxP+T)
- - -
(2.2)
Visual Perception of 3D Motion
131
Figure 1: Viewer-centered coordinate frame and perspective projection. where T = (T,,T,,, T2)t and f2 = ( O L R,, , denote, respectively, the translational and rotational velocities. Image velocity, 8(z,y), is defined as the derivatives, with respect to time, of the z- and y-components of the projection of a scene point. Taking derivatives of equation 2.1 with respect to time, and substituting from equation 2.2 gives
e(r,y) = p(z,V ) A ( ~ : , Y)T+ ~ ( YW r ,
(2.3)
where p(x,y) = 1/Z is inverse depth, and where
The A h , y) and B(x, y) matrices depend only on the image position and the focal length, not on any of the unknowns. Equation 2.3 describes the image velocity for each point on a rigid body, as a function of the 3D motion parameters and the depth. An important observation about equation 2.3 is that it is bilinear; 8 is a
132
David J. Heeger and Allan Jepson
linear function of T and 0 for fixed p , and it is a linear function of p and f2 for fixed T.
3 Optical Flow at Five Image Points Since both p ( z , y) (the depth) and T (the translational component of motion) are unknowns and since they are multiplied together in equation 2.3, they can each be determined only up to a scale factor; that is, we can solve for only the direction of translation and the relative depth, not for the absolute 3D translational velocity or for the absolute depth. For the rest of this paper, T denotes a unit vector in the direction of the 3D translation (note that T now has only two degrees of freedom), and p ( z , y) denotes the relative depth for each image point. It is impossible to recover the 3D motion parameters, given the image velocity, e(z, y), at only a single image point; there are six unknowns on the right-hand side of equation 2.3 and only two measurements [the two components O h , y)]. Several flow vectors, however, may be utilized in concert to recover the 3D motion parameters and depth. Image velocity measurements at five or more image points are necessary to solve the problem (Prazdny 1983), although any number of four or more vectors may be used in the algorithm described below. For each of five image points, a separate equation can be written in the form of equation 2.3. These five equations can also be collected together into one matrix equation (reusing the symbols in equation 2.3 rather than introducing new notation): (3.1)
where 0 (a 10-vector) is the image velocity at each of the five image points, and p (a 5-vector) is the depth at each point. A(T) (a 10 x 5 matrix) is obtained by collecting together into a single matrix A(z,y)T for each z and y:
Visual Perception of 3D Motion
133
Similarly, B (a 10 x 3 matrix) is obtained by collecting together into a single matrix the five B(z, y) matrices:
Finally, q (an %vector) is obtained by collecting together into one vector the unknown depths and rotational velocities, and C(T) (a 10 x 8 matrix) is obtained by placing the columns of B along side the columns of A(T):
4 Recovering the Direction of Translation We now present a method for recovering the observer's 3D translational velocity, T. The depths and rotational velocity need not be known or estimated prior to solving for T. The result is a residual surface, R(T), over the discretely sampled space of all possible translation directions. An illustration of such a residual surface is depicted in Figure 2. The residual function, R(T), is defined such that R(T0) = 0 for To equal to the actual 3D translational velocity of the observer, and such that R(T) > 0 for T different from the actual value. Equation 3.1 relates the image velocities, 8, at five image points to the product of a matrix, C(T) (that depends on the unknown 3D translational velocity), times a vector, q (the unknown depths and rotational velocity). The matrix, C(T), divides 10-space into two subspaces: the &dimensional subspace that is spanned by its columns, and the leftover orthogonal 2-dimensional subspace. The columns of C(T) are guaranteed to span the full 8 dimensions for almost all choices of five points and almost any T. In particular, an arrangement of sample points like that shown on dice is sufficient. The %dimensional subspace is called the range of C(T), and the 2-dimensional subspace is called the orthogonal complement of C(T). Let C'(T) (a 10 x 2 matrix) be an orthonormal basis for the 2-dimensional orthogonal complement of C(T). It is straightforward, using techniques of numerical linear algebra (Strang 1980), to choose a Ci(T) matrix given C(T). The residual function is defined in terms of this basis for the orthogonal complement:
David J. Heeger and Allan Jepson
134
T-space (flattened hemisphere)
Figure 2: The space of all possible translation directions is made up of the points on the unit hemisphere. The residual function, R(T), is defined to be zero for the true direction of translation. The two-dimensional solution space is parameterized by a and P, the angles (in spherical coordinates) that specify each point on the unit hemisphere. Given a measurement of image velocity, 8, and the correct translational velocity, To, the following three statements are equivalent: 0 = C(To)q,
for some q
8 E range[C(To)I
Visual Perception of 3D Motion
t
135
...
Image Velocity Data
Figure 3: The direction of translation is recovered by subdividing the flow field into patches. A residual surface is computed for each patch using image velocity measurements a t five image points from within that patch. The residual surfaces from each patch are then summed to give a single solution.
Since 6' is in the column space (the range) of C(To), and since C'(T0) is orthogonal to C(To), it i s clear that R(Tol = 0. The residual function can be computed in parallel for each possible choice of T, and residual surfaces can be computed in parallel for different sets of five image points. The resulting residual surfaces are then summed, as illustrated in Figure 3, giving a global least-squares estimate for T. It is important to know if there are incorrect translational velocities that might also have a residual of zero. For five point patterns that have small angular extent (e.g., 10" of visual angle or smaller) there may be multiple zeroes in the residual surface. When the inverse depth values of the five points are sufficiently nonplanar, it can be shown that the zeroes of R(T) are concentrated near the great circle that passes through the correct translational direction T = To and through the translational direction that corresponds to moving directly toward the center of the five image points. For four point patterns there is a curve of solutions near this great circle.
136
David J. Heeger and Allan Jepson
The solution is disambiguated by summing residual surfaces from different five point patterns. Two or more sets of five point patterns, in significantly different visual directions, have zeroes concentrated near different great circles. They have simultaneous zeroes only near the intersection of the great circles, namely near T = To. In cases where the inverse depths are nearly planar, or velocity values are available only within a narrow visual angle, it may be impossible to obtain a unique solution. These cases will be the exception rather than the rule in natural situations. The matrices C(T) and CL(T)depend only on the locations of the five image points and on the choice of T; they do not depend on the image velocity inputs. Therefore, the C'(T) matrices may be precomputed (and stored) for each set of five image points, and for each of the discretely sampled values of T. As new flow-field measurements become available from incoming images, the residual function, R(T), is computed in two steps: (1) a pair of weighted summations, given by OtC'(T), and (2) the sum of the square of the two resulting numbers. The algorithm is certainly simple enough (a weighted summation followed by squaring followed by a summation) to be implemented in visual cortex. A number of cells could each compute R(T) (or perhaps some function like exp[-R(T)]), each for a different choice of T, and each for some region of the visual field. Each cell would thus be tuned for a different direction of 3D translation. The perceived direction of motion would then be represented as the minimum (or the peak) in the distribution of cell firing rates. While it is well-known that cells in several cortical areas (e.g., areas MT and MST) of the primate brain are selective for 2D image velocity, physiologists have not yet tested whether these cells are selective for 3D motion.
5 Discussion
The algorithm presented in this paper demonstrates that it is simple to recover the translational component of 3D motion from motion in images. We have tested the algorithm and compared its performance to other proposed algorithms. Simulations using synthetic flow fields with noise added demonstrate that our new approach is much more robust. These results are reported in a companion paper (in preparation). We also show in a companion paper that it is straightforward to recover the rotational velocity and the depth once the translation is known. This helps to substantiate Gibson's conjecture that observers can gain an accurate perception of the world around them by active exploration, and that an unambiguous interpretation of the visual world is readily available from visual stimuli.
Visual Perception of 3D Motion
137
References Anandan, P. 1989. A computational framework and an algorithm for the measurement of visual motion. lnt. J. Comp. Vision 2, 283-310. Bruss, A. R., and Horn, B. K. P. 1983. Passive navigation. Comput. Vision, Graphics, lmage Proc. 21, 3-20. Gibson, J. J. 1950. The Perception of the Visual World. Houghton Mifflin, Boston. Heeger, D. J. 1987. Model for the extraction of image flow. I. Opt. Sac. A m . A 4, 1455-1 471. Horn, B. K. P., and Schunk, B. G. 1981. Determining optical flow. Artifi. Intell. 17,185-203. Longuet-Higgins, H. C., and Prazdny, K. 1980. The interpretation of a moving retinal image. Proc. R. Sac. London B 208, 385-397. Prazdny, K. 1983. On the information in optical flows. Comput. Vision, Graphics lmage Proc. 22, 239-259. Strang, G. 1980. Linear Algebra and Its Applications. Academic Press, New York. Warren, W. H., and Hannon, D. J. 1988. Direction of self-motion is perceived from optical flow. Nature (London) 336, 162-163. Waxman, A. M., and Ullman, S. 1985. Surface structure and three-dimensional motion from image flow kinematics. lnt. J. Robot. Res. 4, 72-94.
Received 28 December 1989; accepted 20 February 1990.
Communicated by Geoffrey Hinton and Steven Zucker
Distributed Symbolic Representation of Visual Shape Eric Saund Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA 94304 USA
The notion of distributed representation has gained significance in explanations of connectionist or ”neural” networks. This communication shows that the concept also offers motivation in devising representations for visual shape within a symbolic computing paradigm. In a representation for binary (silhouette) shapes, and in analogy with conventional distributed connectionist networks, descriptive power is gained when microfeatures are available naming important spatial relationships in images. Our symbolic approach is introduced through a vocabulary of 31 “hand built” shape descriptors operating in the twodimensional shape world of the dorsal fins of fishes. 1 Introduction A distributed representation expresses information through the ensemble behavior of a collection of microfeatures (Rumelhart et al. 1986). In a conventional connectionist network, microfeatures are created as the result of a training procedure. They arise presumably because they capture some statistical regularity over training data, and to this extent microfeatures contribute to a system’s ability to perform correct generalizations and inferences on novel data (Sejnowski and Rosenberg 1987; Rosenberg 1987). The term ”microfeature”refers to the fact that units of information are of relatively small grain size - each microfeature typically stands for only a fragment of what a human would consider a unified conceptual object (Hinton 1989). This paper shows that the notion of distributed representation can profitably be exported from its origin in connectionist networks, and incorporated in a symbolic computing architecture in the problem domain of visual shape representation. Our overall goal is to devise representations for visual shape that will support a wide range of visual tasks, including recognizing shapes, classifying shapes into predetermined categories, and delivering meaningful assessments of the ways in which objects may be considered similar or different in shape. Our specific shape world is a particular class of naturally occurring two-dimensional binary (silhouette) shapes, namely, the dorsal fins of fishes. Microfeatures play an important role in making explicit spatial relationships among Neural Computation 2, 138-151 (1990)
@ 1990 Massachusetts Institute of Technology
Distributed Representation of Visual Shape
139
fragments of shape such as edges and regions. A particular shape is represented not in any single symbolic node, but through the ensemble behavior of a collection of such microfeatures. 2 Shape Fragments Represented by Symbolic Tokens
The system architecture is based on symbolic tokens placed on a blackboard, in the style of a production system. Tokens of various types make explicit fragments of shape such as edges, corners, and partially enclosed regions, as shown in Figure 1. Shape tokens are asserted via grouping processes; characteristic configurations or constellations of primitive tokens provide support for the assertion of additional, more abstract, tokens (Marr 1976). We may view the microfeatures of present concern as occurring at the topmost level in an abstraction hierarchy of token types. An explanation of the token grouping operations themselves lies beyond the scope of this paper; for details the reader is referred to Saund (198813, 1989). In our implementation a microfeature may be regarded as a deformable template, as shown in Figure 2. Unlike purely iconic, pixel-based templates, these templates are not correlated with the image directly, but are asserted according to alignment with more primitive shape fragments represented by shape tokens. The deformation of each template type resembles that of a mechanical linkage, and is characterized by a scalarvalued parameter. For example, the internal parameter of the microfeature shown in Figure 2a makes explicit the relative orientation between an edge and a corner occurring within a certain proximity of one another.
3 Domain-Specific Shape Microfeatures
Under this framework, the descriptive power of a shape representation lies in its vocabulary of deformable template-like descriptors. To what spatial configurations and deformations among shape fragments should explicit microfeatures be devoted? Our view is that a vocabulary of abstract level shape descriptors should be designed to capture the geometric regularities and structure inherent to the target shape domain. Shape events should be labeled that are most useful in characterizing and distinguishing among the shapes that will be encountered by the system. Because the geometric structures and attributes of greatest descriptive significance may differ from domain to domain, it is to be expected that appropriate descriptive vocabularies might be, to at least some degree, domain-specific. For example, a language for distinguishing among the shapes of fish dorsal fins might contain terms for degree of sweepback and angle of protrusion from the body - terms that are irrelevant to the shapes of tree leaves. Just as a connectionist
Eric Saund
140
FULL-CORNER
A PARTIAL-CIRCULAR-REGION
Figure 1: (a) Profile fish shape. (b) Shape tokens denote edge, corner, and region shape fragments occurring in the image. A shape token is depicted as a line segment with a circle at one end indicating orientation.
Distributed Representation of Visual Shape
I
fa)
141
1
:
t
deformation parameter
Figure 2: A microfeature resembles a template that can deform like a mechanical linkage to fit a range of fin shapes. By maintaining a characteristic parameter of deformation, each shape microfeature provides explicit access to an aspect of spatial configuration by which dorsal fins may vary in shape. In (a), a microfeature captures the relative orientation between tokens representing a particular corner (circle) and edge (ellipse). (b)Many microfeatures can overlap and share support as they latch onto shape fragments present in the image.
142
Eric Saund
network embodies knowledge of a problem space within link weights, a vocabulary of shape microfeatures becomes the locus of knowledge about the geometric configurations expected to be encountered in the visual world. Through careful conscious examination of a test set of 43 fish dorsal fin shapes, we have designed a vocabulary of 31 shape microfeatures well-suited to describing and distinguishing among fish dorsal fins. (The class of dorsal fins considered is limited to those that protrude outward from the body; we exclude fishes whose dorsal fins extend along the entire length of the body.) A methodology for designing these shape descriptors is not formalized. Roughly, however, it consisted in identifying collections of dorsal fin shapes that appeared clearly similar or different in some regard, and analyzing the geometric relationships among edge, corner, and region shape fragments that contributed to the similarities or differences in appearance. For example, noticing the "notch feature occurring at the rear of many dorsal fins led to the development of several templates whose deformations correspond to variations in the depth and vertex angle of the notch. The complete set of shape descriptors is presented in Figure 3. Although it is based on symbolic tokens instead of link weights in a wired network, this shape vocabulary shares three important properties with the microfeatures of traditional connectionist distributed representations: (1)Each descriptive element makes explicit a geometric property pertaining to only a portion of the entire shape; note that no descriptor is called, say, ANCHOVY-DORSAL-FIN or SHARK-DORSAL-FIN. (2) The descriptors share support at the image level. This is to say, two or more abstract level deformable templates may latch onto the same edge or corner fragment. (3) The descriptors overlap one another in a redundant fashion. Many microfeatures participate in the characterization of a fin shape, and it is only through the ensemble description that an object's geometry is specified in its entirety. 4 What Is Gained by Using Shape Microfeatures?
The value in this distributed style of representation derives from the direct access it provides to a large number of geometric properties that distinguish shapes in the target domain. By offering explicit names for many significant ways in which one dorsal fin can be similar or different in shape from another, the representation supports a variety of visual tasks, including shape classification, recognition, comparison, and (foreseeably) graphic manipulation. We shall offer three illustrations of the representation at work. First, however, it is useful to consider the deformable template microfeature representation in light of a feature space interpretation. The assertion of an abstract level shape descriptor carries two pieces of infor-
CONFIC~II-IIEICIIT-FICLE-\VIDTII-RXrIO
NOTCII-DEPTII-PICI.E~WIDTII-RATIO
PICLE-POSTERIOR-CORNER~VERTEX-*NCLE
P*RALLEL-SIDES-RELATIVE-ORIENTATION
I
-
,-
LEADING-EDGE-ANGLE
k -3
JXL d iCONFIG-Ill-TOPARC-HEICHT-BASE-WIDTH-RATIO
CONFlC-III-TOPARC-SlZE-9*SE-WIDTH-R*TIO
CONFIO-Ill-TOPARC-ORIENTATION
CONFIC-III-TOPARC-CURV&TURE
NOTCII-DEPTII-BASE-WIDTH-RATIO
NOTCII-SIZE
N O TCII-1’1-VERTEX-ANGLE-DIFFERENCE
NOTCII-III-VERTEX-,\NGl.E-SUh(
NOTCII-VERTEX-ANGLE
NOTCII-FLV~EDCE-CURVATURE
/! -3-
/ >
J--L
Jl JC
44-
L
Figure 3: Complete vocabulary of shape microfeatures designed for the dorsal fin domain. Arrows depict the deformation parameter reflected in each microfeature’s name.
LECPE-BACK-EDGE-CURVATURE
LECPE-BACK-EDGE-ORIENTATION
C:ONFIG~II‘ l ~ l ~ ~ ~ ’ ~ l l N ~ ~ l l - l l ~ ~ ~ ~ N l ~ ~ l , , \ l ~ l ’
CONFIG~II-TOP-COIlNER-FL,\RE
CONFIG- 11- 1 ’ 0 1 ’ - C O I l N E R - ~ ~ S E - D O R I E N ’ 1 ‘1‘ION ,L
PARALLEL-SIDES-SWEEPBACK-ANOLE
PARALLEL-SIDES-NDISTANCE
PARALLEL-SIDES-RELATIVE-SCALE
CONFlG~llllEIOI17~IlASE~\VIDTII-R~rlO
CONFlG-II-’COl’~CORN~R-ShEW
LEADINC-EDCE-REL-l.ENCTll2
CON FIG-11-TOP.COIINER-VERTEX-ANGLE
LEADING-EDCE-CLlnVATURE
LEADING-EDGE-REL-LENGTlll
4
J L
CONFIG-Il-VEII.~EX-PROI-ONTO~UASE~PROPORTION~-~
CONFIG-11-TOP-CORNER-ROUNDEDNESS
I
n
144
Eric Saund
mation: (1) The statement that a qualified configuration of edge, corner, and/or region shape fragments occurs at this particular pose (location, orientation and scale) in the image, and (2) a scalar parameter corresponding to the template deformation required to latch onto these fragments. As such, the complete distributed description of a shape may be regarded as point in a high-dimensional feature space (or hyperspace), each feature dimension contributed by the scalar parameter of an individual microfeature. Note, however, that every shape object creates its own feature space, depending on the microfeatures fitting its particular geometry and different objects’ feature spaces may or may not share particular dimensions in common. For example, certain microfeatures apply only to dorsal fins that are rounded on top (e.g., CONFIG-111-TOPARC-CURVATURE), and these dimensions are absent in the descriptions of sharply pointed fins. In this regard the present shape representation differs from a connectionist network, which employs the same set of nodes, or hyperspace feature dimensions, for all problems. Figure 4a shows that the distributed shape vocabulary supports classification of dorsal fin shapes into categories. One simple representation for a shape category is a window in a microfeature hyperspace. A shape’s membership in a given category may be cast as a function of (1) the microfeatures it shares in common with the category’s hyperspace, and (2) the shape’s proximity to the category window along their common microfeature dimensions. Details of such a scheme are presented in Saund (1988b1, however, the efficacy of this approach lies not in the computational details, but in the particular feature dimensions available for establishing category boundaries. By tailoring microfeatures to the significant geometric properties of the domain, the representation provides fitting criteria for the establishment of salient equivalence classes of shapes. Figure 4b shows that the representation supports evaluation of degree of similarity among shapes. Here, dorsal fins were rank ordered by a computer program that estimated similarity to a target fin (circled). This computation would be useful to a shape recognition task, in which it must be determined whether a novel shape is sufficiently similar to a known candidate shape. Of course, different shapes may be considered similar or different from one another in many different ways. Under the microfeature representation it is possible to design similarity measures differentially weighting various shape properties according to contextual or task-specific considerations. Again, as in the shape classification task, we emphasize that a rich vocabulary of shape microfeatures provides an appropriate language for expressing the criteria by which shape similarity and difference may be evaluated. Finally, Figure 4c shows that a microfeature representation supports analysis of the ways in which one shape must be deformed to make it more similar to another. A computer program drew arrows indicating that, for example, a Trout-Perch dorsal fin must be squashed and skewed
Distributed Representation of Visual Shape
145
A
LECPE-BACK-EDGE-ORIENTATION
\
\
t
,
A
A
\ \
CONFIG-11-TOP-CORNER-BASE-DORIENTATION
NOTCH-DEPTH-PICLE-WIDTH-RATIO
Figure 4a: (1) Dotted line segregates a region of microfeature hyperspace in which “Flaglike” dorsal fins are clustered.
Eric Saund
146
CRIECORY-ROW(OE0
Figure 4a: (2) Dorsal fin shapes classified into several perceptually salient categories.
Distributed Representation of Visual Shape
target: Mudminnows
cat egory : rounded
ICill i fis hes 1
category: rounded
target :
target:
Cars
target: Gars
147
category: rounded
category: broomstick
Figure 4: (b) Fin shapes rank ordered by rated similarity to a target shape (circled).
to the right to make it more similar to an Anchovy dorsal fin. Because microfeatures make explicit deformation parameters of deformable shape templates, the components of deformation relating two shapes sharing some of the same microfeaturesmay be read off the microfeatures directly. Going in the converse direction, one may imagine that a computer graphics application could directly control the deformation of individual shape
Eric Saund
148
Anchovies
1
c Puffers
/
Cavefishes
(4 Figure 4: (c)Arrows depict the deformation required to transform one dorsal fin shape into another. Magnitude and direction componentswere read directly off shape microfeatures. microfeatures, which would then push and shove on other microfeatures and more primitive fragments of shape to modify an object's geometry under user control. An energy-minimization approach to this facility is presented in Saund (1988a, 1988b). 5 Conclusion
We have presented a "hand-built" representation for a particular class of shapes illustrating that distributed representations can be employed to advantage in symbolic computation as well as in wired connectionist networks. By designing shape microfeatures according to the structure and constraints of a target shape domain, the descriptive vocabulary provides explicit access to a large number of important geometric properties useful in classifying, comparing, and distinguishing shapes. This approach amounts to taking quite seriously Marr's principle of explicit naming (Marr 1976): If a collection of data is treated as a whole, give it a (symbolic) name so that the collection may be manipulated as a unit. This is the function fulfilled by shape microfeatures that explicitly label certain configurations of edge, corner, and region shape fragments, and this is also the function fulfilled by "hidden units" in connectionist
Distributed Representation of Visual Shape
149
networks, which label certain combinations of activity in other units. The lesson our shape vocabulary adopts from the connectionist tradition is that the chunks of information for which it is useful to create explicit names can be of small grain size, and, they may be quite comprehensible even if they do not correspond with the conceptual units of the casual human observer. Appendix A Overview of the Computational Procedures This appendix sketches the computational procedures by which a description of a shape in terms of abstract level shape microfeatures may be obtained from a binary image. In our implementation, the description of a shape exists fundamentally as a set of tokens or markers in a blackboard data structure called the Scale-Space Blackboard. This data structure provides for efficient indexing of tokens on the basis of spatial location and size; this is achieved through a stack of two-dimensional arrays, each of whose elements contains a list of tokens present within a localized region of the image, and within a range of scales or sizes. A scale-normalized measure of distance is provided to facilitate access of tokens occurring within spatial neighborhoods, where the neighborhood size is proportionate to scale. A shape token denotes the presence in the image of an image event conveyed by the token's type. Thus, a token is a packet of information carrying the following information: type, 5 and y location, orientation, scale (size), plus additional information depending on the token's type. Initially, a shape is described in terms of tokens of type, PRIMITIVEEDGE, at some finest scale of resolution. These may be computed from a binary silhouette by performing edge detection, edge linking, contour tracing, and then by placing tokens at intervals along the contour. Next, PRIMITIVE-EDGE tokens are asserted at successively larger scales through a fine-to-coarse aggregation or grouping procedure. Roughly, whenever a collection of PRIMITIVE-EDGES is found to align with one another at one scale, a new PRIMITIVE-EDGE may be asserted at the next larger scale. The combinatorics of testing candidate collections of tokens is managed by virtue of the spatial indexing property of the Scale-Space Blackboard: each PRIMITIVE-EDGE at one scale serves as a "seed" for a larger scale PRIMITIVE-EDGE, and only PRIMITIVE-EDGES located within a local (scalenormalized distance) neighborhood of the seed are tested for the requisite alignment. The multiscale PRIMITIVE-EDGE description of a shape serves as a foundation for additional token grouping operations leading to the assertion of EXTENDED-EDGES, FULL-CORNERS, and PARTIAL-CIRCULAR-REGIONS as pictured in Figure 1. The grouping rules are in each case based
150
Eric Saund
on identifying clusters or collections of shape tokens occurring with a prescribed spatial configuration. For example, an EXTENDED-EDGE may be asserted whenever a collection of PRIMITIVE-EDGES is found to lie, within certain limits, along a circular arc; in addition to location, orientation, and scale, each EXTENDED-EDGE maintains an internal parameter denoting the curvature of the arc. The grouping rules include means for ensuring that a given image event (e.g., edge or corner) will be labeled by only one token of the appropriate type in the spatial, orientation, and scale neighborhood of the event. Instances of abstract level shape microfeatures are asserted through essentially the same kind of token grouping procedure. Each microfeature specifies an acceptable window on the spatial configuration (scalenormalized distance, relative orientation, direction, and relative scale) FULL-CORNER, and/or PARTIALof a pair or triple of EXTENDED-EDGE, CIRCULAR-REGION type tokens. For example, the microfeature pictured in Figure 2 demands the presence on the Scale-Space Blackboard of a FULLCORNER token and an EXTENDED-EDGE token in roughly the proximity shown. The deformation parameter of each microfeature is typically a simple expression of one aspect of the spatial configuration and/or internal parameters of its constituents, such as relative orientation, scalenormalized distance, scale-normalized curvature (of an EXTENDED-EDGE constituent), vertex angle (of a FULL-CORNER constituent), etc., or ratios of these measures. Again, the combinatorics of testing pairs or triples of shape tokens to see whether they support a given microfeature is limited by the spatial indexing property of the Scale-Space Blackboard. Instances of the microfeature pictured in Figure 2 are thus found by testing in turn each FULL-CORNER token on the Scale-Space Blackboard; for each FULL-CORNER tested, all of the EXTENDED-EDGE tokens within a local scale-normalized distance neighborhood are gathered up and tested for appropriate proximity. This computation scales linearly with the number of FULL-CORNER tokens present. When the image contains one or more protuberant objects such as dorsal fin shapes, the microfeatures pertaining to each may be isolated by identifying collections of microfeatures clustering appropriately in spatial location and overlapping in their support. For example, for the LECPE-BACK-EDGE-CURVATURE microfeature and the LEADING-EDGECURVATURE microfeature to be interpreted as pertaining to the same dorsal fin shape (see Figure 3), the FULL-CORNER constituent of the former must align with the EXTENDED-EDGE constituent of the latter, and vice versa. Typically, microfeature instances may be found in isolation at scattered locations in an image, but, because they are tailored to the morphological properties of dorsal fins, only at these shapes will microfeatures cluster and overlap one another extensively. Once the microfeatures pertaining to a given dorsal fin have been isolated, the shape of this object is interpreted in terms of the microfeatures’ deformation parameters as described in the text.
Distributed Representation of Visual Shape
151
Acknowledgments This paper describes work done at the MIT Artificial Intelligence Laboratory. Support for the Laboratory's research is provided in part by DARPA under ONR contract N00014-85-K-0124. The author was supported by a fellowship from the NASA Graduate Student Researchers Program.
References Audubon Society Field Guide to North American Fishes, Whales, and Dolphins. 1983. Knopf, New York. Hinton, G. 1989. Connectionist learning procedures. Artif. Intell., 401-3, 185234. Marr, D. 1976. Early processing of visual information. Phil. Trans. R. SOC.London B 275,483-519. Rosenberg, C. 1987. Revealing the structure of NETtalk's internal representations. Proc. 9th Ann. Conf. Cognitive Sci. SOC.,Seattle, WA, 537-554. Rumelhart, D., Hinton, G., and Williams, R. 1986. Parallel Distributed Processing: Explorations in the Structure of Cognition. Bradford Books, Cambridge, MA. Saund, E. 1988a. Configurationsof shape primitives specified by dimensionalityreduction through energy minimization. Proc. 1987 AAAl Spring Symp. Ser., Palo Alto, CA, 100-104. Saund, E. 1988b. The role of knowledge in visual shape representation. MIT A1 Lab TR 1092. Saund, E. 1989. Adding scale to the primal sketch. Proc I E E E CVPR, San Diego, CA, 70-78. Sejnowski, T., and Rosenberg, C. 1987. Parallel networks that learn to pronounce English text. Complex Syst. 1, 145-168.
Received 26 April 1989; accepted 15 March 1990.
Communicated by Richard Andersen
Modeling Orient ation Discriminat ion at Multiple Reference Orientations with a Neural Network M. Devos G. A. Orban Laboratorium voor Neuro- en Psychofysiologie, Katholieke Universiteit Leuven, Campus Gasthuisberg, Herestraat, B-3000Leuven, Belgium
We trained a multilayer perceptron with backpropagation to perform stimulus orientation discrimination at multiple references using biologically plausible values as input and output. Hidden units are necessary for good performance only when the network must operate at multiple reference orientations. The orientation tuning curves of the hidden units change with reference. Our results suggest that at least for simple parameter discriminations such as orientation discrimination, one of the main functions of further processing in the visual system beyond striate cortex is to combine signals representing stimulus and reference. 1 Introduction Ever since the first microelectrode studies of the retina, it has been clear that our understanding of peripheral parts of the visual system (retina, geniculate, and primary cortex) progresses more rapidly than that of visual cortical areas further removed from the photoreceptors. Contributing to this lack of understanding is the fact that it is more difficult to find relevant stimuli and experimental paradigms for studying the latter areas than the former ones. Recent developments in neural networks (Lehky and Sejnowski 1988; Zipser and Andersen 1988) have suggested that such systems could assist the visual physiologist by suggesting which computations might be performed by higher order areas. There has been a recent surge of interest in the relationship between single cell response properties and behavioral thresholds (Bradley et al. 1987; Vogels and Orban 1989). The question at hand is how much additional processing is required to account for the behavioral thresholds beyond the primary cortical areas in which cells were recorded. Most investigators have concluded that processing beyond striate cortex (Vl) should achieve invariance of the signals arising from V1. At that level single cell responses do indeed depend on many parameters including the object of discrimination, although discrimination itself is invariant for random changes in irrelevant parameters in humans (Burbeck and Regan Neural Computation 2, 152-161 (1990) @ 1990 Massachusetts Institute of Technology
Modeling Orientation Discrimination
153
1983; Paradiso et al. 1989) as well as in animal models (De Weerd, Vandenbussche, and Orban unpublished; Vogels and Orban unpublished). We have investigated this question by using a feedforward, three-layer perceptron to link single cell orientation tuning and just noticeable differences orientation both measured in the behaving monkey. We have used as inputs units with orientation tuning similar to that of striate neurons recorded in the awake monkey performing an orientation discrimination task (Vogels and Orban 1989). In this task, a temporal same-different procedure was used. Two gratings were presented in succession at the same parafoveal position while the monkey fixated a fixation target. If the two gratings had the same orientation the monkey had to maintain fixation. If they differed in orientation, the monkey had to saccade to the grating. The network was trained by backpropagation to achieve the same discrimination performance as the monkey. The network decided whether the stimulus presented to the input units was tilted clockwise or anticlockwise from the reference orientation. This discrimination procedure, an identification procedure, was more easy to implement in a three-layer perceptron than the temporal same-different procedure used in the animal experiments. Human psychophysical studies have shown that different psychophysical procedures used to determine discrimination thresholds yield very similar thresholds (Vogels and Orban 1986). Initially (Orban et al. 1989), we trained and tested the network only at one reference orientation. This study revealed little about further cortical processing, since no hidden units were necessary to achieve optimal performance. In the present study we show that when the network is trained and tested at multiple reference orientations, hidden units are required, and we have studied the properties of these hidden units to make predictions for neurophysiological experiments. 2 TheModel
The number of input and hidden units of the multilayer perceptron was variable, but only two output units were used, corresponding to the two perceptual decisions, tilted clockwise or counterclockwise from reference. Input units either represented the stimulus (stimulus units) or the reference orientation (reference units). Stimulus units (Fig. 1A) had Gaussian orientation tunings modeling the tunings recorded in the discriminating monkey (Vogels and Orban 1989) and their variability. Typically 10-40 units were equally spaced over the stimulus orientation range (180"). There were as many reference units as reference orientations and their tuning curve was an impulse function, that is, their value was maximum when a given reference was trained or tested but otherwise minimum. Hidden and output units were biased, that is, a constant was added to each unit's inputs before computing the unit's output. Determining this constant value was part of the learning. Both during training and testing
M. Devos and G. A. Orban
154
t+
15
/
I I
-A-+/ A +/ I
I
I
I
I
1
2
3
4
5
# of reference orientations
Figure 1: Network without hidden units: orientation tuning curves plotting each unit's activity as a function of stimulus Orientation of input units (A) and output units (B-D) and resulting psychometric curves plotting percentage correct decisions as a function of stimulus orientation (E and F). The network was trained and tested for one reference orientation (B and E), three references (C and F), and five references (D). In A-D the full line indicates the average tuning and the stippled lines one standard deviation around the mean. The 10 input units had Gaussian tuning with a SD of 19", and a response strength (activity at optimal orientation) of 110 spikes/sec.
Modeling Orientation Discrimination
155
the orientations presented to the network were more closely spaced near the reference(s) than those further away. The network was trained with a variant of the backpropagation (Devos and Orban 1988). Training was terminated when the orientation performance curve, plotting percentage correct discrimination as a function of orientation, remained stable for five successive training tests. From these curves just noticeable differences (jnds) were estimated by taking the average orientation difference corresponding to 75% correct for both sides of the reference orientation. The jnds given for each set of network parameters, such as, for example, the number of stimulus or hidden units are averages of 50 tests, since for each set of parameters the training was repeated five times and each configuration obtained after training was tested 10 times. 3 Hidden Units Are Necessary to Discriminate at Multiple References
As previously mentioned, no hidden units are necessary for discrimination at a single reference (Orban et al. 1989). In this case, the output units exhibit a sharp change in activity at the reference orientation (Fig. 1B) and the “psychometric curve” has a single narrow symmetrical dip at the reference orientation (Fig. 1E). Increasing the number of references to three induces multiple slopes in the orientation tuning of the output unit (Fig. 1 0 . Since these slopes are less steep than for a single reference, the dips in the psychometric curve are somewhat wider, yielding larger jnds (Fig. 2), but the psychometric curves have additional dips that do not correspond to references (Fig. 1F). Hence the jnds represent the network performance only for orientations close to one of the three references. When the number of references is increased to five, the tuning curve of the output units is nearly flat and the network can no longer learn the problem (Fig. 1D). The network of Figure 1 had no hidden units and only 10 stimulus units. Increasing the number of stimulus units to 40 improves the performance for two, three, and especially four references (Fig. 2), but the learning still fails at five references. Introducing four or eight hidden units immediately solves the problem (Fig. 2). As shown in Figure 3, the output units again display a steep change in activity at each reference orientation, but this can be achieved only by a change in orientation tuning of the output units with change in reference (Fig. 3). 4 Orientation Tuning of Hidden Units Changes with Reference
~
As shown in Figure 3, not only the tuning curve of output units, but also that of hidden units changes with reference orientation. Analysis of the different networks obtained for a wide range of conditions (10 to 40 input units, 2 to 8 hidden units, 2 to 5 references) reveals that the strategy of the network is always the same. Output units need to have pulse-like tuning
M. Devos and G. A. Orban
156
”
INPUT UNIT
OUTPUT UNIT: 1 reference
.-ch>
-0
> %
I
+a
I
2
-5
0.5. .
*,(’.<, <>\$.;.
c
a.
.I
-
0.5-!
.-c
5..
c
%&.\
-‘.ih
=
’;
0.0-
,I
1 reference
3 references
E
c
c
U
U
F
?!
?!
8 *
100
s
50
c
Q,
E
Q
0.oL -90
I
I
I
0
90
-90
J
0
90
stimulus orientation (degrees)
Figure 2: Network performance with and without hidden units. The jnd in orientation (see text) of the network is plotted as a function of number of references: crosses, no hidden units and 10 input units; triangles, 40 input units and no hidden units; squares, 40 input units and four hidden units; dots, 40 input units and 8 hidden units.
Modeling Orientation Discrimination
157
-->> L
L
0
=C J
5 L
z
3
Figure 3: Orientation tuning curves of output (A and B) and hidden units (CH) of a network performing at two references: 45" (A,C,E,G) and 90" (B,D,F,H). Same conventions as in Figure 1. curves with 90" periodicity to yield a psychometric curve with a single narrow dip at the reference. These 90" wide pulse curves are synthetized from the curves of the hidden units, which also have steep parts in their tuning curve, but which are not necessarily separated by 90". With a change in reference the curves of the hidden units also change so that
M. Devos and G. A. Orban
158
B
A OUTPUT UNIT
1
-90
0
SAMPLE HIDDEN UNIT
1
I
90
-90
I
0
90
stimulus orientation (degrees)
Figure 4: Orientation tuning curves of an output unit (A) and a hidden unit (B) in a network with four hidden units in which the reference units are directly connected to the output units. Same conventions as in Figure 1.
different "building blocks" become available to synthetize the 90" wide pulse curves of the output units. These changes in the hidden units' tuning curve with changes in reference are obtained by giving very high weights to the connections between reference units and hidden units. Hence, turning a reference unit on or off will shift the "working slopes" of the hidden units. This scheme suggests that the number of hidden units required will increase as the number of references increases. In our study, which used a maximum of five references, a network with eight hidden units did not perform better than one with four hidden units (Fig. 2). One could argue that the change in orientation tuning of the hidden units is not an essential feature of the network, but a trivial consequence of the connection between the reference units and the hidden units. Therefore we devised a network in which reference units were directly connected to the output units and trained this network in orientation discrimination at five references. Although the result was slightly better than with no hidden units, the network performed poorly. The output units' tuning curves displayed multiple weak slopes (Fig. 41, which yielded psychometric curves with multiple dips as in the case of a network without hidden units and fewer references (e.g., 3, Fig. 1). Of course, this problem could be treated by adding a second layer of hidden units connected with the reference units. Hence, a connection between the last hidden unit layer and the reference units is essential for good network performance.
Modeling Orientation Discrimination
159
5 Conclusions Others, such as Crick (19891, have pointed out that backpropagation networks are not adequate models of brain function. They are used here merely as proof of existence in a way analogous to Lehky and Sejnowski (1988) and Zipser and Andersen (1988). It could however be argued that contrary to those studies, the properties of the hidden units we obtained are physiologically unrealistic. Orientation tuning curves are generally found to be bell shaped (Orban 1989 for review) and their selectivity is generally summarized by their bandwidth. The few studies devoted to orientation tuning of higher order cortical areas have compared bandwidths (see Orban 1989 for review). Tuning curves with multiple peaks have nonetheless been reported for a few V1 units (De Valois et al. 1982; Vogels and Orban 1989) and for a number (10/147) of V3 units (Felleman and Van Essen 1987). The hidden unit of Figure 4 had a tuning curve with two peaks separated by somewhat less than 90" and which can be described as multipeak. It may be that the multipeak tuning curves reported in the physiological literature in fact are extremes of tuning curves with steep slopes that allow sharp discriminations. Our results also suggest that the brain combines signals from the stimulus and from the reference orientations. Recent physiological studies (Haenny et al. 1988) suggest that some V4 units may encode the reference orientation in a matching to sample task. On the other hand, lesions of IT impair orientation discrimination (Dean 1978). Hence, it seems that for both simple discriminations and for more difficult object recognition, the pathway from V1 to IT is required. This pathway is relayed by V4 where orientation-selective units have been described (Desimone and Schein 1987). Hence a convergence of stimulus and reference signals is possible either in V4 or in IT. The latter structure could then correspond to the last step in visual processing before the decision (i.e., last layer of hidden units); indeed IT has extensive connections with limbic and frontal cortex areas (for review see Rolls 1985). The units selective for reference orientation in the sample to match tasks used by Haenny et al. (1988) were relatively broadly tuned, suggesting a distributed coding as is the case for stimulus orientation. We have used a local value coding for the reference and it remains to be seen whether or not a different encoding scheme for the reference will yield a different interaction between reference and stimulus at the level of the hidden units. However, the present simulations suggest that the contribution of higher order cortical areas in orientation discrimination should be investigated at multiple reference orientations, and that the neurons should have steep slopes in their tunings, possibly changing with reference. The strongest prediction derived from these simulations is that cortical neurons should change their orientation tuning curves for the test stimulus based on a previously displayed reference stimulus. This change could either be a shift in the slopes of the tuning curve or a
160
M. Devos and G. A. Orban
switch from tuning to nontuning depending on the reference orientation. We are presently testing these predictions in IT of monkeys performing an orientation discrimination task.
Acknowledgments The technical help of P. Kayenbergh, G. Vanparrijs, and G. Meulemans as well as the typewriting of Y.Celis is kindly acknowledged. This work was supported by Grant RFO/AI/Ol from the Belgian Ministry of Science to GAO.
References Bradley, A., Skottun, B. C., Ohzawa, I., Sclar, G., and Freeman, R. 1987. Visual orientation and spatial frequency discrimination: A comparison of single neurons and behavior. J. Neurophysiol. 57,755-772. Burbeck, C. A., and Regan, D. 1983. Independence of orientation and size in spatial discriminations. J. Opt. SOC.Am. 73, 1691-1694. Crick, F. 1989. The recent excitement about neural networks. Nature (London) 337,129-132. Dean, P. 1978. Visual cortex ablation and thresholds for successively presented stimuli in rhesus monkeys: I. Orientation. Exp. Brain Res. 32, 445458. Desimone, R., and Schein, S. J. 1987. Visual properties of neurons in area V4 of the macaque: Sensitivity to stimulus form. J. Neurophysiol. 57, 835-868. De Valois, R. L., Yund, E. W., and Hepler, N. 1982. The orientation and direction selectivity of cells in macaque visual cortex. Vision Res. 22, 531-544. Devos, M. R., and Orban, G. A. 1988. Self-adapting back-propagation. Proc. Neuro-Nhes 104-112. Felleman, D. J., and Van Essen, D. C. 1987. Receptive field properties of neurons in area V3 of macaque monkey extrastriate cortex. f. Neurophysiol. 57, 889920. Haenny, P. E., Maunsell, J. H. R., and Schiller, P. H. 1988. State dependent activity in monkey visual cortex: II. Retinal and extraretinal factors in V4. Exp. Brain Res. 69,245-259. Lehky, S. R., and Sejnowski, T. J. 1988. Network model of shape-form shading: Neural function arises from both receptive and projective fields. Nature (London) 333,452454. Orban, G. A. In press. Quantitative electrophysiology of visual cortical neurons. In Vision and Visual Dysfunction, The EIecfrophysioIogyof Vision, Vol. 5, A. Leventhal, ed., Macmillan, New York. Orban, G. A., Devos, M. R., and Vogels, R. In press. Cheapmonkey: Comparing ANN and the primate brain on a simple perceptual task Orientation discrimination. Proc. NATO ARW. Paradiso, M. A., Carney, T., and Freeman, R. D. 1989. Cortical processing of hyperacuity tasks. Vision Res. 29, 247-254.
Modeling Orientation Discrimination
161
Rolls, E. T. 1985. Connections, functions and dysfunctions of limbic structures, the pre-frontal cortex and hypothalamus. In The Scientijic Basis of Clinical Neurology, M. Swash and C. Kennard, eds., pp. 201-213. Churchill Livingstone, London. Vogels, R., and Orban, G. A. 1986. Decision factors affecting l i e orientation judgments in the method of single stimuli. Percept. Psychophys. 40, 74-84. Vogels, R., and Orban, G. A. 1989. Orientation discrimination thresholds of single striate cells in the discriminating monkey. SOC. Neurosc. Abstr. 15, 324. Zipser, D., and Andersen, R. A. 1988. A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature (London) 331, 679-684.
Received 25 September 1989; accepted 23 January 1990.
Communicated by Ralph Linsker
Temporal Differentiation and Violation of Time-Reversal Invariance in Neurocomputation of Visual Information D. S. Tang Microelectronics and Computer Technology Corporation, 3500 West Balcones Center Drive, Austin, TX 78759-6509 USA
V. Menon Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712 USA
Information-theoretic techniques have been employed to study the time-dependent connection strength of a three-layer feedforward neural network. The analysis shows (1) there is a natural emergence of time-dependent receptive field that performs temporal differentiation and (2) the result is shown to be a consequence of a mechanism based on violation of the time-reversal invariance in the visual information processing system. Both analytic and numerical studies are presented. 1 Introduction
A synthetic three-layer feedforward neural network is proposed and shown to have the ability to detect time-dependent changes in the input signal. Using a few physical assumptions about the properties of the transmission channels, we deduce purely within the informationtheoretical framework (Shannon and Weaver 1949) that temporal differentiation is based on the violation of time-reversal invariance of the information rate. These results may be relevant to the study of early visual information processing in the retina and the construction of artificial neural network for motion detection. An earlier information-theoretic study (Linsker 1989) with a different model has indicated that the cell's output is approximately a linear combination of a smoothed input and a smoothed first time derivative of the input. Here, we show analytically how the temporal differentiation emerges. Section 2 discusses the informationtheoretic formalism of time-dependent neural nets in general terms. A simple three-layer feedforward network is introduced. Analytic solutions to the time-dependent transfer function are derived in Section 3, which are followed by a study of the properties with further numerical calculations in Section 4. Section 5 describes the input-output relations between the layers. Section 6 summarizes the main results. Neural Computation 2,162-172 (1990) @ 1990 Massachusetts Institute of Technology
Neurocomputation of Visual Information
163
2 Time-Dependent Information-Theoretic Formalism
Consider a three-layer feedforward network (Linsker 1986). Each layer of neurons is assumed to be two-dimensional in the X-Y plane spatially. Light signals with variable intensity are assumed to be incident on the first layer, layer A, in the Z-direction. Activities of the layer A neurons are relayed to the second layer, layer B. The output activities in layer B are then sent to layer C neurons. Below we specify the assumptions of our model. Each output neuron at location T is locally connected to its input neurons in the previous layer. The input neurons are spatially distributed according to the gaussian distribution density p(R) = C, exp(-R2/2a2) with C, = N / 2 r a 2 . N is the total number of input neurons, a' is the variance, and R is measured relative to T . All spatial vectors are two-dimensional in the X-Y plane. The output signal, Y,(t),of the ith layer B neuron is assumed to be linear,
Each connection has a constant weight COmodulated by a microscopic time delay factor exp[-b(t - 7-11 as in an RC circuit. This time delay models a physical transmission property of the channel, which possesses a simple form of memory, i.e., past signals persist in the channel for a time interval of the order l / b . Here, b is the reciprocal of the decay time constant. The temporal summation is from a finite past time t - 6 to the present time t. From the causality principle, no future input signals X j ( T ) ,T > t, contribute to the present output signal X ( t ) . It is assumed that the time scale 6 >> l / b is satisfied. The index (2) in the spatial summation means that the N input neurons are randomly generated according to the density p relative to the location ~i of the ith output neuron in accordance with assumption 1. The stochastic input signals X j ( 7 )are assumed to satisfy the a priori probability distribution function
(2.2)
which is a gaussian and statistically independent, B$ = u&j6,,. Here, fi is the total number of distinct space-time labels of X~(T). X is the mean of X j b ) .
D. S. Tang and V. Menon
164
3. The output signal, Z,(t), of the mth neuron in layer C is assumed to be
where the transfer function %,(TI satisfies the constraints (1)‘HTX = A0 in matrix notation, and (2) &=,, ,N,r=t-6, ,t XF1,(~)l2 = AI. A0 and A1 are real constants. The first constraint is on the normalization of the transfer function. The second constraint restricts the value of the statistical mean of the transfer functions up to a sign. The resultant restriction on the transfer functions when these two constraints are considered together is a net constraint on the variance of the transfer functions since the variance is directly proportional to A0 - AI. In the far-past, T < t - 6, and the future, T > t, ‘H,(T) is set to zero. These transfer functions will be determined by maximizing the information rate in the next section. Here, an additive noise n , ( ~added ) to K ( T ) [i.e., Y,(T)4 Y , ( T ) + ~ , ( Tin ) ] equation (2.3) is assumed for the information-theoretic study. The noises n,(r)indexed by the location label i and the time label T are assumed to be statistically independent and satisfy a gaussian distribution of variance p2 with zero mean. Below we derive the information-theoreticequation that characterizes the behavior of the transfer function ‘H,(T). The information rate for the signal transferred from layer B to layer C can be shown to be (2.4)
with W being the spatiotemporal correlation matrix ( (Y,(T)Y,(T’))). From equations (2.1) and (2.21, the following expressions for the correlation matrix can be derived.
with
and Grr, = e - b l T - r ‘ l
e(s - lT
-T1~)
(2.7)
Here, B denotes the Heaviside step function. To arrive at equation (2.7), terms of order exp(-b6) have been neglected as the time scale assumption Sb >> 1 is evoked. h is a constant independent of space and time. Its explicit form is irrelevant in subsequent discussions, as will be shown below.
Neurocomputation of Visual Information
165
Now, the method of Lagrange multiplier is employed to optimize the information rate. This is equivalent to performing the following variational calculation with respect to the transfer function
[ [p(.)I2
{
(5 7-Pw7-k2 l+
-
A,]}
=0
(2.8)
Here, k2 is a Lagrange multiplier. It is a measure of the rate of change of the signal with respect to the variation of the value of the overall transfer function. The variational equation above produces the folIowing eigenvalue problem governing the behavior of the transfer function
jT‘
This equation defines the morphology of the receptive field of the neurons in layer C in space and time. For simplicity, one can absorb the multiplicative factor h in X and k2. We assume that this is done by setting h = 1. 3 Analytic Expressions of the Transfer Function
In continuous space and time variables, the eigenvalue problem is reduced with the use of the time-translational invariance of the temporal correlation function (equation 2.7) to the homogeneous Fredholm integral equation X7-l(r,T ) =
Irn dR -x j
/b’2 -612
MR, 7’)
d r ’ p ( R ) [Q(r- R)G(T- 7 ’ )+ kz] (3.1)
Here, Q(r- R ) and G(T- T ’ ) are the continuous version of Qij and GTTlof equations (2.6) and (2.7), respectively. Note that the kernel respects both space- and time-reversal invariances. It means that the kernel itself does not have a preferential direction in time and space. Note also that the information rate also respects the same symmetry operation. However, if it turns out that the solutions N ( r , T ) do not respect some or all of the symmetries of the information rate, a symmetry-breaking phenomenon is then said to appear. This can have unexpected effects on the input-output relationship in equation (2.3). In particular, the violation of time-reversal invariance underlies temporal differentiation, the ability to detect the temporal changes of the input signal. This will be discussed in the next section. The solutions to this integral equation can be found by both the Green’s function method and the eigenfunction expansion method (Morse
D. S. Tang and V. Menon
166
and Feshbach 1953). They can be classified into different classes according to their time-reversal and space-reversal characteristics, X ( r ,T ) = X(-r ,- T ) , ~ ( Tr ) ,= - X ( - r , T I , X ( r , T ) = Z ( r , -TI, ‘MT,7)= M - r , -T), . . ., etc. In this study, we are interested only in the solutions that correspond to the two largest eigenvalues (i.e., the larger one of these two has the maximum information rate). They are as follows.
3.1 Symmetric Solution. This solution obeys both space- and timereversal invariance.
+
[-Q-
-612 5 7 I s/2 (3.2)
X ( ~ , T )= 0, otherwise with A = 612, D = 1.244, 77 = X/1.072C,,m2, and
(3.3) with y being determined by (3.4) In equation (3.2), /Q is the normalization constant. The eigenvalue is determined from q by 1 - 2a27rCp -{(1-
x
k2
11 + (Q2/02,)1
+
) (d
1 [exp(-bb) - 11 11+ (a2/0i)1 11 + (cu2/u&>1 b 2(b2 + r2)sinya - 61) (3.5) by(b cos yA - y sin Ay>
1
+
Note that all the space- and time-dependent terms in equation (3.2) are even functions of both space and time. In obtaining these equations, we have used in the eigenfunction expansion the fact that the iteration equation 2a2/0:+, = 1- 1/(3+ 2a2/03 with CT; = 30’ converges rapidly to 2a2/0k so that the approximation 0: = 02’ = . .. = 0,’= . . . = 02 = 2.7320~’ is accurate with error less than one percent. Details may be found in Tang (1989).
Neurocomputation of Visual Information
167
3.2 Time-Antisymmetric-Space-Symmetric Solution. function is Z ( T , T ) = hle-(T’ 2 205) sin y r , -612 5 r 5 612
Z ( r , r ) = 0,
The transfer
(3.6)
otherwise
The eigenvalue is A=
with
AT
AT
(3.7)
0.268(3 + 2a2/aL) being determined by 2b
xT --2 b2+y in which y satisfies the following self-consistent equation
y = -b tan Ay
(3.8)
hl is the normalization constant. 4 Properties of the Solution to the Eigenvalue Problem The behavior of the eigenvalue of the symmetric solution (equation 3.5) is shown as the curved line in Figure 1. In contrast, the eigenvalues of the antisymmetric solution (equation 3.7) are independent of the parameter k2 since the integral Jf$2 sin y r d r is identically zero. Therefore, the eigenvalue for the antisymmetric case remains constant as k2 varies, as depicted by the horizontal line in Figure 1. The statement that the eigenvalue does not depend on k2 is true for any eigenfunctions that are antisymmetric with respect to either time- or space-reversal operation. The interesting result is that for positive and not too negative k2 Values, the eigenvalue of the symmetric solution is the largest. That is, the transfer function producing the maximum information rate respects the symmetry (time- and space-reversal invariances) of the information rate. However, as k2 becomes sufficiently negative, the time-antisymmetric solution has the largest eigenvalue since the maximum symmetric eigenvalue decreases below that of the time-antisymmetric solution. In this regime, the symmetry breaking of the time-reversal invariance occurs. Within the range of arbitrarily large positive value of k2 to the point of the transition of the symmetry breaking of the time-reversal invariance, parity conservation is respected. The transfer functions at the center of the receptive field plotted against time for different values of ICZ are shown in Figure 2. Transfer functions located farther from th center show similar behavior except for a decreasing magnitude, as a re ult of the spatial gaussian decay. It is found that no spatial center-surround morphology corresponding to maximum information rate appears for any given time in the k2 regime we are considering.
cq
D. S. Tang and V. Menon
168
12
11.5
11
5 w
1
TIME-sYMhErRIcSOLUTIOI
10.5
?
5
2 10
/
9.5
TIMEANTIS-C
8.
-
-3.02
-0.015
-0.01
-0.005
SOLUTION
I
I
I
0.005
0.01
0.015
Figure 1: The maximum eigenvalues for the symmetric and time-antisymmetric solutions plotted against the Lagrange multiplier kz. Here, the time constant l / b is arbitrarily chosen as 5 msec.
169
Neurocomputation of Visual Information
4
I
I
I
I
I
I
I I I I I I
3
I
I I I
I
I I I
2
I
I
I
I I I I I I I
1
c
-1
-1
_.
-4
I
I
I
-40
-20
I I
0
I
I
20
40
TIME(ms)
Figure 2: Time-dependent properties of the temporal transfer function. The present time is arbitrarily chosen to be at 50 msec. The time in the far-past is at -50 msec. Curves a, b, and c depict the symmetric transfer functions with eigenvalues 10.2, 9.5, and 9, respectively. Curve d is the time-antisymmetric transfer function with eigenvalue equal to 9.24. The time constant l / b is 5 msec.
D. S. Tang and V. Menon
170
5 Input-Output Relations
From the results of last section, the output events can naturally be classified into time-symmetric or time-antisymmetric events. The timesymmetric/antisymmetric events are defined as the set of output activities obtained by the linear input-output relationship having a timereversal invariant/noninvariant transfer function. They are statistically independent from each other, from equations (2.3) and (2.9)
{ (Z;zmmetric(t)
(a))= 0
ZEtisymmetric
(5.1)
The symmetric transfer function (equation 3.2) does not discriminate the inputs in the near-past [O, 6/21 from those in the far-past [-6/2,01. The output event duplicates as much of the input signals as possibly allowed in this three-layer feedforward network operating in an optimal manner within the information-theoretic framework. This is illustrated by curves a, b, and c in Figure 3. In obtaining these figures, the light signal impinging on layer A is assumed to be a stationary light spot modeled by a step function e(t),that is, it is off for time t 5 0 and on for time t 2 0. The curve labeled layer B output is a typical output signal of the response of an RC circuit with a step function input (equation 2.1). These results suggest that the input-output relation with the symmetric transfer function (equation 3.2) in this IEZ regime operates in the information-relaying mode. It acts as a passive relay. The speed of the response is determined by the width 6 of the time window - the shorter the width the faster the speed. The behavior of the input-output relation with the timeantisymmetric transfer function (equation 3.6) is totally different. It is a simple form of temporal differentiation as illustrated by curve d in Figure 3. Temporally constant input signal produces zero output. Note that the peak of the output is time delayed by 0.56 compared with the time defining the fastest change in the input function. This temporal differentiation is not identical to the time derivative in calculus, even though both can detect the temporal changes in the input signal and both are time-antisymmetric operations. These transfer functions process the input signals to form the output differently. In the regime of time-reversal invariance, the transfer functions (equation 3.2) simply relay the input to the output without actively processing the inputs, curves a, b, and c in Figure 3. However, in the regime of time-reversal noninvariance, the transfer functions (equation 3.6) have the capability to extract the temporal changes of the temporal input signals, curve d in Figure 3. 6 Conclusions
We have analyzed, from information-theoretic considerations, how a simple synthetic three-layer, feedforward neural network acquires the ability
Neurocomputation of Visual Information
171
16(
14L
12(
lo(
8[
6C
4c
2c
C
LAYER C OUTPUT
I I I I
I
-20
0
I
I
I
50
100
150
0
TIME(ms)
Figure 3: The inputboutput relations. The input to layer B is a stationary and local light spot defined by O ( t ) . Curves a, b, and c are layer C outputs that correspond to the symmetric transfer functions a, b, and c in Figure 2, respectively. Curve d is the layer C output with time-antisymmetric transfer function corresponding to the d curve in Figure 2. Note that only curve d signals the temporal change in the incoming signal.
172
D. S . Tang and V. Menon
to detect temporal changes. Symmetry breaking in time-reversal invariance has been identified as the source of such an ability in our model. Furthermore, the symmetry-classesto which the transfer function belongs define the distinct categories of the temporal events in the output sample space. We summarize below the main results of this paper. 1. The persistence of signals in the network is an important aspect of temporal signal processing. In the present study we have modeled this as a channel with memory (equation 2.1) and it underlies the results obtained. 2. Eigenvectors of the constrained spatial-temporal correlation function are the transfer functions of the three-layer, feedforward neural network model studied here.
3. The symmetries that the eigenvectors respect and define the classes of the output events. 4. There are two modes of the temporal information processing: one is the information-relaying mode defined by the time-symmetric transfer function (equation 3.2) and the other is the informationanalyzing mode defined by the time-antisymmetric transfer function (equation 3.6).
5. Realization of the information-analyzing mode is done by a symmetry-breaking mechanism. 6. The breaking of the time-reversal invariance leads to temporal differentiation.
References Linsker, R. 1986. From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proc. Natl. Acad. Sci. U.S.A. 83, 7508. Linsker, R. 1988. Self-organizationin a perpetual network. Computer 21(3), 105. Linsker, R. 1989. In Advances in Neural Information Processing System I, D. S. Touretzky, ed.,p. 186. Morgan Kaufman, San Mateo, CA. Morse, P. M., and Feshbach, H. 1953. Methods of Theoretical Physics, Vol. 1. McGraw-Hill, New York. Shannon, C. E., and Weaver, W. 1949. The Mathematical Theory of Communication. Univ. of Illinois Press, Urbana. Tang, D. S. 1989. Information-theoretic solutions to early visual information processing: Analytic results. Phys. Rev. A 40, 6626. Received 16 October 1989; accepted 23 February 1990
Communicated by Ralph Linsker
Analysis of Linsker’s Simulations of Hebbian Rules David J. C. MacKay Computation and Neural Systems, Galtech 164-30 CNS, Pasadena, CA 91125 USA
Kenneth D. Miller Department of Physiology, University of California, San Francisco, CA 94143-0444 USA
Linsker has reported the development of center-surround receptive fields and oriented receptive fields in simuiations of a Hebb-type equation in a linear network. The dynamics of the learning rule are analyzed in terms of the eigenvectors of the covariance matrix of cell activities. Analytic and computational results for Linsker’s covariance matrices, and some general theorems, lead to an explanation of the emergence of center-surround and certain oriented structures. We estimate criteria for the parameter regime in which center-surround structures emerge.
Linsker (1986, 1988) has studied by simulation the evolution of weight vectors under a Hebb-type teacherless learning rule in a feedfonvard linear network. The equation for the evolution of the weight vector w of a single neuron, derived by ensemble averaging the Hebbian rule over the statistics of the input patterns, is’
‘Our definition of equation 1.1 differs from Linsker’s by the omission of a factor of 1 / N before the sum term, where N is the number of synapses. Also, Linsker allowed more general hard limits, TLE - 1 5 w,5 TLE,0 < n~ < 1, which he implemented either directly or by allowing a fraction 711: of synapses to be excitatory (0 5 UJ? 5 1) and the remaining fraction 1 - n~ to be inhibitory (-1 5 w; 5 0). These two formulations are essentially mathematically equivalent; this equivalence depends on the fact that the spatial distributions of inputs and correlations in activity among inputs were taken to be independent of whether the inputs were excitatory or inhibitory. Linsker summarized results for 0.35 5 7 1 5~ 0.65 for his layer B + C, but did not report any dependence of results on n E within this range and focused discussion on n E = 0.5. At higher layers only n~ = 0.5 was discussed. Equation 1.1 is equivalent to 7LE = 0.5. Our analysis does not depend critically on this choice; what is critical is that the origin be well within the interior of the hypercube of allowed synaptic weights, so that initial development is linear.
Neural Computation 2,173-187 (1990) @ 1990 Massachusetts Institute of Technology
David J. C. MacKay and Kenneth D. Miller
174
where Q is the covariance matrix of activities of the inputs to the neuron. The covariance matrix depends on the covariance function, which describes the dependence of the covariance of two input cells' activities on their separation in the input field, and on the location of the synapses, which is determined by a synaptic density function. Linsker used a gaussian synaptic density function. Similar equations have been developed and studied by others (Miller et al. 1986, 1989). Depending on the covariance function and the two parameters k~ and k2, different weight structures emerge. Using a gaussian covariance function (his layer B + C), Linsker reported the emergence of nontrivial weight structures, ranging from saturated structures through centersurround structures to bilobed-oriented structures. The analysis in this paper examines the properties of equation 1.1. We concentrate on the gaussian covariances in Linsker's layer f3 C. We give an explanation of the structures reported by Linsker and discuss criteria for the emergence of center-surround weight structures. Several of the results are more general, applying to any covariance matrix Q. Space constrains us to postpone general discussion, technical details, and discussion of other model networks, to a future publication (MacKay and Miller 1990). --f
2 Analysis in Terms of Eigenvectors
We write equation 1.1 as a first-order differential equation for the weight vector w:
w = (Q + k&w
+ kln
subject to -wmx 5 w, 5 wmaX
(2.1)
where J is the matrix JtJ = 1 V i , j , and n is the DC vector n, = 1 Vz. This equation is linear, up to the hard limits on w,. These hard limits define a hypercube in weight space within which the dynamics are confined. We make the following assumption: Assumption 1. The principal features of the dynamics are established before the hard limits are reached. When the hypercube is reached, it captures and preserves the existing weight structure with little subsequent change. The matrix Q + k2J is symmetric, so it has a complete orthonormal set of eigenvectors2e(,)with real eigenvalues A., The linear dynamics within the hypercube can be characterized in terms of these eigenvectors, each of 2The indices a and b will be used to denote the eigenvector basis for w, while the indices i and j will be used for the synaptic basis.
Linsker's Simulations of Hebbian Rules
175
which represents an independently evolving weight configuration. First, equation 2.1 has a fixed point at
Second, relative to the fixed point, the component of w in the direction of an eigenvector grows or decays exponentially at a rate proportional to the corresponding eigenvalue. Writing w(t) = C , w,(t)e(,), equation 2.1 yields
w,(tf
-
w," = [wa(0)- w3ex.t
(2.3)
Thus, the principal emergent features of the dynamics are determined by the following three factors: 1. The principal eigenvectors of Q + k2J, that is, the eigenvectors with largest positive eigenvalues. These are the fastest growing weight configurations.
2. Eigenvectors of Q + IC2J with negative eigenvalue. Each is associated with an attracting constraint surface, the hyperplane defined by w, = w,".
3. The location of the fixed point of equation 1.1. This is important for two reasons: (a) it determines the location of the constraint surfaces and (b) the fixed point gives a "head start" to the growth rate of eigenvectors e(a)for which Iw,"~ is large compared to Iw,(O)I (see Fig. 3). 3 Eigenvectors of Q
We first examine the eigenvectors and eigenvalues of Q. The principal eigenvector of Q dominates the dynamics of equation 2.1 for kl = 0, k? = 0. The subsequent eigenvectors of Q become important as kl and k2 are varied. Some numerical results on the spectrum of Q have appeared in Linsker (1987,1990)and Miller (1990). Analyses of the spectrum when output cells are laterally interconnected appear in Miller et al. (1986, 1989). 3.1 Properties of Circularly Symmetric Systems. If an operator commutes with the rotation operator, its eigenfunctions can be written as eigenfunctions of the rotation operator. For Linsker's system, in the continuum limit, the operator Q + k2J is unchanged under rotation of the system. So the eigenfunctions of Q + k2J can be written as the product of a radial function and one of the angular functions cos 18, sin 18, I = 0,1,2. . .. To describe these eigenfunctions we borrow from qilantum mechanics the notation n = 1 , 2 , 3 . .. and 1 = s, p, d . . . to denote the function's total number of nodes = 0, 1 , 2 .. . and number of angular
David J. C. MacKay and Kenneth D. Miller
176
Name
Eigenfunction
XJN
Is
e-r2/2R
1CIA
2p 2s
T
cos 0e-r2/2R
12C/A
(1 - ~ ~ / r & -%CIA ~ ~ / ~ ~
Table 1: The First Three Eigenfunctions of the Operator Qfr,r'). Q(r, r') = e-lr-r'12/2Ce-T'2/2A,where C and A denote the characteristic sizes of the covariance function and synaptic density function. r denotes two-dimensional spatial position relative to the center of the synaptic arbor, and T = 1 1. The eigenvalues X are normalized by the effective number of synapses N = 27rA.
nodes = 0,1,2. . ., respectively. For example, "2s" and "2p" both denote eigenfunctions with one node, which is radial in 2s and angular in 2p (see Fig. 1). For monotonic and nonnegative covariance functions, we conjecture that the leading eigenfunctions of Q are ordered in eigenvalue by their numbers of nodes such that the eigenfunction [nl]has larger eigenvalue than both [(n+l)l]and [n(l-tl>].This conjectureis obeyed in analytical and numerical results we have obtained for Linsker's and similar systems. The general validity of this conjecture is under investigation.
3.2 Analytic Calculations for k2 = 0. We have solved analytically for the first three eigenfunctions and eigenvalues of the covariance matrix for layer B -+ C of Linsker's network, in the continuum limit (Table 1). Is, the function with no changes of sign, is the principal eigenfunction of Q; Zp, the bilobed-oriented function, is the second eigenfunction; and 2s, the center-surround eigenfunction, is third.3 Figure l a shows the first six eigenfunctions for layer B -+ C of Linsker (1986). 32s is degenerate with 3d at kz = 0.
Linsker's Simulations of Hebbian Rules
177
Figure 1: Eigenfunctions of the operator Q+kzJ. In each row the eigenfunctions have the same eigenvalue, with largest eigenvalue at the top. Eigenvalues (in arbitrary units): (a) k2 = 0 Is, 2.26; 2p, 1.0; 2s and 3d, 0.41. (b) k2 = -3: 2p, 1.0; Zs, 0.66; Is, -17.8. The gray scale indicates the range from maximum negative to maximum positive synaptic weight within each eigenfunction. Eigenfunctions of the operator (e-lr-f12/2c + k~)e-r'2/2Awere computed for C I A = 213 (as used by Linsker for most layer i3 4 C simulations) on a circle of radius 12.5 grid = 6.15 grid intervals. intervals, with
178
David J. C. MacKay and Kenneth D. Miller
4 The Effects of the Parameters k1 and k2
Varying k2 changes the eigenvectors and eigenvalues of the matrix Q+k2J. Varying kl moves the fixed point of the dynamics with respect to the origin. We now analyze these two changes, and their effects on the dynamics. Definition. Let A be the unit vector in the direction of the DC vector n. We refer to (w . fi) as the DC Component of w. The DC component is proportional to the sum of the synaptic strengths in a weight vector. For example, 2p and all the other eigenfunctions with angular nodes have zero DC component. Only the s-modes have a nonzero DC component. 4.1 General Theorem: The Effect of k2. We now characterize the effect of adding k J to any covariance matrix Q.
Theorem 1. For any covariance matrix Q, the spectrum of eigenvectors and eigenvalues of Q + k2J obeys the following: 1. Eigenvectors of Q with no DC component, and their eigenvalues, are unaffected by k2.
2. The other eigenvectors, with nonzero DC component, vary with k2. Their eigenvalues increase continuously and monotonically with k2 between asymptotic limits such that the upper limit of one eigenvalue is the lower limit of the eigenvalue above.
3. There is at most one negative eigenvalue. 4 . All but one of the eigenvalues remain finite. In the limits IC2 -+ &00 there is a DC eigenvector A with eigenvalue -+ CZN, where N is the dimensionality of Q, that is, the number of synapses.
The properties stated in this theorem, whose proof is in MacKay and Miller (1990), are summarized pictorially by the spectral structure shown in Figure 2. 4.2 Implications for Linsker's System. For Linsker's circularly symmetric systems, all the eigenfunctions with angular nodes have zero DC component and are thus independent of k2. The eigenvalues that vary with IC2 are those of the s-modes. The leading s-modes at k2 = 0 are Is, 2s; as k2 is decreased to -00, these modes transform continuously into 2s, 3s respectively (Fig. 2).4 Is becomes an eigenvector with negative eigenvalue, and it approaches the DC vector A. This eigenvector enforces a constraint w . A = wFp. A, and thus determines that the final average synaptic strength is equal to wm . n/N. 4The 2s eigenfunctions at kz = 0 and k2 = -aboth have one radial node, but are not identical functions.
Linsker’s Simulations of Hebbian Rules
179
Figure 2: General spectrum of eigenvalues of Q + kzJ as a function of kz. A: Eigenvectors with DC component. B: Eigenvectors with zero DC component. C: Adjacent DC eigenvalues share a common asymptote. D: There is only one negative eigenvalue. The annotations in parentheses refer to the eigenvectors of Linsker’s system. Linsker (1986) used k2 = -3. This value of k2 is sufficiently large that the properties of the k2 + --oo limit hold (MacKay and Miller 19901, and in the following we concentrate interchangeably on kz = -3 and kz + -m. The computed eigenfunctions for Linsker’s system at layer B -+ C are shown in Figure l b for kz = -3. The principal eigenfunction is 2p. The center-surround eigenfunction 2s is the principal symmetric eigenfunction, but it still has smaller eigenvalue than 2p. 4.3 Effect of k,. Varying kl changes the location of the fixed point of equation 2.1. From equation 2.2, the fixed point is displaced from the origin only in the direction of eigenvectors that have nonzero DC
180
David J. C. MacKay and Kenneth D. Miller
component, that is, only in the direction of the s-modes. This has two important effects, as discussed in Section 2 (1)The s-modes are given a head start in growth rate that increases as kl is increased. In particular, the principal s-mode, the center-surround eigenvector 2s, may outgrow the principal eigenvector 2p. (2) The constraint surface is moved when kl is changed. For large negative k2,the constraint surface fixes the average synaptic strength in the final weight vector. To leading order in 1/k2, Linsker showed that the constraint is C w3= kl/)kZl.5 4.4 Summary of the Effects of kl and k2. We can now anticipate the explanation for the emergence of center-surround cells: For kl = 0, k2 = 0, the dynamics are dominated by 1s. The center-surround eigenfunction 2s is third in line behind 2p, the bilobed function. Making kz large and negative removes 1s from the lead. 2p becomes the principal eigenfunction and dominates the dynamics for k1 ‘v 0, so that the circular symmetry is broken. Finally, increasing kl/lkz( gives a head start to the center-surround function 2s. Increasing kl /I k2 I also increases the final average synaptic strength, so large kl/lk21 also produces a large DC bias. The center-surround regime therefore lies sandwiched between a 2p-dominated regime and an all-excitatory regime. k l / I k2 I has to be large enough that 2s dominates over 2p, and small enough that the DC bias does not obscure the center-surround structure. We now estimate this parameter regime.
5 Criteria for the Center-Surround Regime
We use two approaches to determine the DC bias at which 2s and 2p are equally favored. This DC bias gives an estimate for the boundary between the regimes dominated by 2s and 2p. 1. Energy Criterion: We first estimate the level of DC bias at which the weight vector composed of (2s plus DC bias) and the weight vector composed of (2p plus DC bias) are energetically equally favored. This gives an estimate of the level of DC bias above which 2s will dominate under simulated annealing, which explores the entire space of possible weight configurations. 2. Time Development Criterion: Second, we estimate the level of DC bias above which 2s will dominate under simulations of time development of equation 1.1. We estimate the relationship between the parameters such that, starting from a typical random distribution of initial weights, the 2s mode reaches the saturating hypercube at the same time as the 2p mode. 5T0 next order, this expression becomes C w3 = l i l / l k z + 41, where 4 = ( Q t l ) , the average covariance (averaged over i and j ) . The additional term largely resolves the discrepancy between Linsker’s g and kl/lk21 in Linsker (1986).
Linsker's Simulations of Hebbian Rules
181
Both criteria will depend on an estimate of the complex effect of the weight limits -w,,, 5 wi 5 w,,,. (Without this hypercube of saturation constraints, 2p will always dominate the dynamics of equation 1.1 after a sufficiently long time.) We introduce g = kl/(lk21Nwm,) as a measure of the average synaptic strength induced by the DC constraint, such that g = 1 means all synapses equal w,,.~ Noting that a vector of amplitude f l w m a x has rms synaptic strength wmax,we make the following estimate of the constraint imposed by the hypercube (discussed further in MacKay and Miller 1990):
Assumption 2. When the DC level is constrained to be g, the component h(g) in the direction of a typical unit AC vector at which the hypercube constraint is "reached is h(g) = f i w m a X ( l - 9). Assumptions 1 and 2 may not adequately characterize the effects of the hypercube on the dynamics, so the numerical estimates of the precise locations of the boundaries between the regions may be in error. However, the qualitative picture presented by these boundaries is informative. 5.1 Energy Criterion. Linsker suggested analysis of equation 1.1 in terms of the energy function on which the dynamics perform constrained gradient descent. The energy of a configuration w = C wae(a)is
(5.1) a
where n, is the DC component of eigenvector e(a). We consider two configurations, one with wzP equal to its maximum value h(g) and wzS = 0, and one with wzP = 0 and wzS = siF(n&(g). The component wls is the same in both cases. All the other components are assumed to be small and to contribute no bias in energy between the two configurations. The energies of these configurations will be our estimates of the energies of saturated configurations obtained by saturating 2p and 2s, respectively, subject to the constraints. We compare these two energies and find the DC level g = gE at which they are equal:7
For Linsker's layer B
--+
C connections, our estimate of g E is 0.16.
5.2 Time Development Criterion. The energy criterion does not take into account the initial conditions from which equation 1.1 starts. We now derive a second criterion that attempts to do this. @I'his is equal to twice Linsker's g. 7X/N is written as a single entity because X 0: N . Also nzs a constant as kz 4 00.
N
l/kz, so nzskz tends to
182
David J. C. MacKay and Kenneth D. Miller
Figure 3: Schematic diagram illustrating the criteria for 2s to dominate. The polygon of size h(g) represents the hypercube. Energy criterion: The points marked EzP and E b show the locations at which the energy estimates were made. Time development criterion: The gray cloud surrounding the origin represents the distribution of initial weight vectors. If W ~ ~ ( is O sufficiently ) small compared to and if the hypercube is sufficiently close, then the weight vector reaches the hypercube in the direction of 2s before wp has grown appreciably.
WE,
If the initial random component in the direction of Zp, wZp(O), is sufficiently smallcompared to WE,which provides 2s with a head start, then wzP may never start growing appreciably before the growth of wa saturates (Fig. 3). The initial component wzp(0) is a random quantity whose
Linsker's Simulations of Hebbian Rules
183
typical magnitude can be estimated statistically from the weight initialization parameters. U I Z ~ ( Oscales ) ~ ~ as l/v% relative to the nonrandom quantity Hence the initial relative magnitude of wzP can be made arbitrarily small by increasing N , and the emergence of center-surround structures may be achieved at any g by using an N sufficiently large to suppress the initial symmetry breaking fluctuations. We estimate the boundary between the regimes dominated by 2s and 2p by finding the choice of parameters such that wZp(t) and w2&) reach the hypercube at the same time. We evaluate the time tzs at which wzS reaches the hypercube.s Our estimate of the typical starting component for 2p is wzp(0)rms= &(g)wmax where u(g) is a dimensionless standard deviation derived in MacKay and Miller (1990). We set wzp(tzs) = h(g), and solve for W , the number of synapses above which wzS reaches the hypercube before Q,, in terms of g:
WE.
5.3 Discussion of the Two Criteria. Figure 4 shows gE and N*(g). The two criteria give different boundaries. In regime A, 2p is estimated to both emerge under equation 1.1, and to be energetically favored. Similarly, in regime C , 2s is estimated to dominate equation 1.1, and to be energetically favored. In regime D, the initial fluctuations are so big that although 2s is energetically favored, symmetry breaking structures can dominate equation 1.1.9 Lastly, in regime B, although 2p is energetically favored, 2s will reach saturation first because N is sufficiently large that the symmetry breaking fluctuations are suppressed. Whether this saturated 2s structure will be stable, or whether it might gradually destabilize into a 2p-like structure, is not predicted by our analysis." The possible difference between simulated annealing and equation 1.1 makes it clear that if initial conditions are important (regimes B and D), the use of simulated annealing on the energy function as a quick way of finding the outcome of equation 1.1 may give erroneous results. Figure 4 also shows the areas in the parameter space in which Linsker made the simulations he reported. The agreement between experiment and our estimated boundaries is reasonable. sWe set wz,(O) = 0, neglecting its fluctuations, which for large N are negligible : . compared with w 91f the initial component of 2s is toward the fixed point, the 2s component must first shrink to zero before it can then grow in the opposite direction. Thus, large fluctuations may either hinder or help 2s, while they always help 2p. loIn a one-dimensional model system we have found that both cases may be obtained, depending sensitively on the parameters.
David J. C. MacKay and Kenneth D. Miller
184
N 1000
800
600 400 200
"0
.1
*
gE
.2
I
I
J
.3
.4
.5
g
Figure 4: Boundaries estimated by the two criteria for C I A = 213. To the left of the line labeled gE, the energy criterion predicts that 2p is favored; to the right, 2s is favored. Above and below the line N*(g),the time development criterion estimates that 2s and 2p, respectively, will dominate equation 1.1. The regions X, Y, mark the regimes studied by Linsker: (X)N = 300 - 600,g = 0.3-0.6: the region in which Linsker reported robust center-surround; (Y)N = 30C-600, g <- 0.2: asymmetric center-surround structures and (near g = 0 ) bilobed cells.
6 Conclusions and Discussion
For Linsker's B -+ C connections, we predict four main parameter regimes for varying kl and k2.l1 These regimes, shown in Figure 5, are dominated by the following weight structures: k2 k2
=o, kl = 0
The principal eigenvector of Q, 1s.
The flat DC weight vector, which leads =large positive and/or kl = large to the same saturated structures as 1s.
"not counting the symmetric regimes ( k l , k z ) structures are inverted in sign.
tf
(-kl, k z ) in which all the weight
Linsker’s Simulations of Hebbian Rules
185
Figure 5: Parameter regimes for Linsker’ssystem. The DC bias is approximately constant along the radial lines, so each of the regimes with large negative Ic2 is wedge shaped.
IC2
=large negative, kl _N 0
The principal eigenvector of Q + kzJ for k2 + -00,2p.
kz =large negative, The principal eigenvector of Q + k2J for kl = intermediate kz + -m with nonzero DC component, 2s. The size of this regime can depend on the size of the symmetry-breaking fluctuations, and hence on the number of synapses.
Higher layers of Linsker’s network can be analyzed in terms of the same four regimes; the principal eigenvectors are altered, so that different structures can emerge (MacKay and Miller 1990). Linsker suggested that the emergence of center-surround structures may depend on the peaked synaptic density function that he used (Linsker 1986, p. 7512). However, with a flat (”pillbox”) density function, the eigenfunctions are qualitatively unchanged, so we expect that centersurround structures may emerge by the same mechanism.
186
David J. C. MacKay and Kenneth D. Miller
The development of the interesting cells in Linsker‘s layer B -+ C depends on the use of negative synapses and on the use of the terms k, and k2 to enforce a constraint on the final percentages of positive and negative synapses. Both of these may be biologically problematic (Miller 1990; MacKay and Miller 1990). A linear Hebb rule like Linsker’s can be derived without the use of negative synapses by examining the difference between the innervation strengths of two equivalent excitatory projections, for example, left-eye and right-eye inputs (Miller et al. 1989) or ON-center and OFF-center inputs (Miller 1989). However, in this case the constants kl and k2 disappear from the equation for the development of the difference of synaptic strengths because these constants take on equal values for each of the two equivalent populations. Therefore, there will only be one regime, in which the principal eigenvector of Q dominates. Such a model can nonetheless develop interesting receptive field structures if oscillations exist in the covariance functions of the input layer, and particularly if lateral interactions are introduced in the output layer (Linsker 1987; Miller et al. 1989; Miller 1989, 1990).
Acknowledgments D.J.C.M. is supported by a Caltech Fellowship and a Studentship from SERC, UK. K.D.M. thanks M. I? Stryker for encouragement and financial support while this work was undertaken. K.D.M. was supported by an N.E.I. Fellowship and the International Joint Research Project Bioscience Grant to M. P. Stryker (T. Tsumoto, Coordinator) from the N.E.D.O., Japan. This collaboration would have been impossible without the internet /NSFnet.
References Linsker, R. 1986. From basic network principles to neural architecture (series). Proc. Natl. Acad. Sci. U.S.A. 83, 7508-7512, 8390-8394, 877943783. Linsker, R. 1987. Towards an organizing principle for perception: Hebbian synapses and the principle of optimal neural encoding. IBM Research Report RC 12830. Linsker, R. 1988. Self-organization in a perceptual network. Compufer 21(3), 105-117. Linsker, R. 1990. Designing a sensory processing system: What can be learned from principal components analysis? Pvoc. Int. Joint Conf. on Neural Nefworks, Jan. 1990,M. Caudill, ed., pp. II:291-97. Lawrence Erlbaum, Hillsdale, NJ. MacKay, D. J. C., and Miller, K. D. 1990. Analysis of Linsker’s application of Hebbian rules to linear networks. Network, to appear. Miller, K. D. 1989. Orientation-selective cells can emerge from a Hebbian mechanism through interactions between ON- and OFF-center inputs. SOC. New rosci. Absfr. 15, 794.
Linsker’s Simulations of Hebbian Rules
187
Miller, K. D. 1990. Correlation-based mechanisms of neural development. In Neuroscience and Connectionist Theory, M.A. Gluck and D.E. Rumelhart, eds., pp. 267-353. Lawrence Erlbaum, Hillsdale, NJ. Miller, K. D., Keller, J. B., and Stryker, M. P. 1986. Models for the formation of ocular dominance columns solved by linear stability analysis. SOC.Neurosci. Abstr. 12, 1373. Miller, K. D., Keller, J. B., and Stryker, M. P. 1989. Ocular dominance column development: Analysis and simulation. Science 245, 605-615.
Received 17 January 1990; accepted 20 February 1990.
Communicated by David Rumelhart
Generalizing Smoothness Constraints from Discrete Samples Chuanyi Ji Robert R. Snapp" Demetri Psaltis Department of Electrical Engineering, California Institute of Technology, Pasadena, C A 91125 USA
We study how certain smoothness constraints, for example, piecewise continuity, can be generalized from a discrete set of analog-valued data, by modifying the error backpropagation, learning algorithm. Numerical simulations demonstrate that by imposing two heuristic objectives - (1) reducing the number of hidden units, and (2) minimizing the magnitudes of the weights in the network - during the learning process, one obtains a network with a response function that smoothly interpolates between the training data. 1 Introduction Extensive numerical simulations have demonstrated the utility of error backpropagation (BP) (Rumelhart et al. 1986; Werbos 1974; Sejnowski and Rosenberg 1987) for training a multilayer neural network to learn a given set of input-output associations. The issue of generalization training a network to respond reasonably to input data not present in the training set - is usually addressed by overconstrainingthe network. Recently a lower bound on the number of training samples required for generalization by a feedforward network with a fixed number of hidden units has been asymptotically estimated using the saturation property of Vapnik's and Chervonenkis's growth function (Baum and Haussler 1989; Vapnik and Chervonenkis 1968). How to match a network's size and architecture to a given training set so that the response generalizes well is still, however, an unsolved problem. In addition to accurately replicating the training data, the network response function - the values of the output units as a function of the network inputs - should reflect any constraints that may govern the training problem. Obtaining these constraints directly from a finite set of data is not possible, because many different rules can produce the given *Current address: Department of Computer Science and Electrical Engineering, Votey Building, University of Vermont, Burlington, VT 05405 USA.
Neural Computation 2,188-197 (1990) @ 1990 Massachusetts Institute of Technology
Smoothness Constraints from Discrete Samples
189
set. Thus, supplementary information is required. This information is generally problem dependent and often eludes a concise formulation. In this paper, we explore an approach to this problem suitable for analogue association problems subject to certain smoothness constraints. This includes the important class of pattern association problems that originate from a natural setting, and hence are subject to constraints resulting from the continuous nature of physical law. These implicit smoothness constraints have been successfully exploited in other contexts. For example, to infer the locations and orientations of physical surfaces from a set of visual inputs, it is important to assume that these surfaces are piecewise continuous (Marr 1981). Smoothness constraints are often used to ”regularize” other ill-posed inverse problems that arise in early vision (Poggio and Koch 1985) and in other fields (Tikhonov and Arsenin 1977). For this broad class of problems, the problem of generalization reduces to that of finding a network with a response function that interpolates smoothly between the training samples. As a simple example, we consider a set of points taken from the graph of a continuous real valued function of one variable and a feedforward network with a single input unit, a layer of N hidden sigmoidal units, followed by a single linear output unit. If N and the initial magnitudes of the network weights are chosen too large, the network obtained under the BP algorithm usually has a very irregular response function (cf. Figs. 1 and 2). One way of obtaining a network with a smooth response function is to add a measure of smoothness to the standard BP error function as a perturbation. For example, one might average an absolute measure of the local curvature of the response function over the expected distribution of network inputs. As this requires integrating over a fine mesh embedded in the input space, more computation may be required than is practical. Instead, we will foIlow a more intuitive path. In particular we will modify the standard BP learning rule in such a way as to (1)reduce the number of hidden units in the network, and (2) minimize the magnitudes of the network’s weights. As it tends to overconstrain the network, the first objective parallels that of polynomial regression, where one seeks the lowest order polynomial that reliably fits a given set of data (Akaike 1977). Furthermore, this objective will often eliminate spurious local extrema in the response function. The second objective is designed to avoid unnecessarily abrupt transitions in the response function. This follows from observing that the gradient of the sigmoidal function f ( ~ ~ x - 0with ) respect to the input vector, x, is proportional to the weight vector, w. This intuitive approach was partially inspired by a recent algorithm designed to reduce the number of weights in a network during the training phase (Rumelhart 1988). Here, one auxiliary term, Cia(wi), was added to the standard BP error function, the summation being performed over all of the weights and thresholds in the network. The terms of the summation, a(w) = w2/(1 + w2),measure the magnitudes of the weights
Chuanyi Ti, Robert R. Snapp, and Demetri Psaltis
190
relative to unity. Thus, this summation is a rough measure of the number of "significant" weights in the network; adding it to the error function biases the algorithm toward architectures that use the least number of sigruficant weights. The combined energy function is then minimized by steepest descent. After a certain training criterion is reached, weights with magnitudes falling below a critical threshold can be removed from the network by clamping their values to zero. Although this algorithm reduces the number of weights, it does not effectively reduce the number of hidden units, as architectures with fewer hidden units, but the same number of weights, are not favored. In contrast to the above, we add two terms to the BP learning rule. The first is designed to remove as many hidden units as possible, while maintaining an acceptable level of error in the response function over the training data. For this to succeed, the units must be operating near their transition regime. The second term is designed to satisfy this requirement as well as minimize the magnitudes of the weights. These modifications to BP are detailed in Section 2. In Section 3, we present the results of several numerical simulations that demonstrate their effectiveness. In the first set of simulations we show that a network, beginning with a large number of hidden units, can be reduced in size to one having a response function that smoothly interpolates between the training data points. In the second set, we construct training data from a network with an "unknown" number of hidden units, and show that the algorithm can be used to infer the architecture of the unknown network with a high probability. Finally, we present our conclusions in Section 4. 2 The Network Reduction Algorithm
We consider the problem of training a feedforward network having a single input unit, one layer of N hidden sigmoidal units, and a single linear output unit, to smoothly interpolate between the M ordered pairs, {(z", y"): 7r = 1 , . . . ,M } , of a given training set. (Here, y" is the desired output value when the network input is set to 9.) We assume that the number of hidden units has been initially estimated to be larger than necessary. Let uz and u1 denote the input and output weights of the ith hidden unit, and O,, its threshold value. The response function of the network then has the form g(z; w, @)= w,f(u,~-O,), where, for notational q ,. . . , up,)',and 6 = (el, . . . ON)^. The convenience, we let w = ( ~ 1 , .. . ,up,, sigmoidal function is usually taken to be a modified hyperbolic tangent, that is, f(s)= 1/(1 + e-2). Under BP, one attempts to find weight and threshold values that minimize the standard error function, E,(w,e) =
M
C [g(zz;w,e) - y"]' 7 F l
by gradient descent (Rumelhart et al. 1986; Werbos 1974).
Smoothness Constraints from Discrete Samples
191
For the architecture described above, we define a hidden unit to be
significant if it is coupled to both the input and output units with weights of a significantly large magnitude, that is, greater than one. Thus, the quantity,
s, = a(ui)a(wi) can be viewed as a measure of the significance of the ith hidden unit, where, as before, a(w) = w2/(1 + w2). Following the first objective of the previous section, we desire to favor those architectures that require the fewest number of significant hidden units. If the given training set does not fully constrain the given network architecture, then there is, in general, a degenerate set of solutions over which E0(w,6 ) is acceptably small. Following our first objective, we add a term proportional to
i=l j=1
to the standard error function. This biases the algorithm toward those solutions that require the fewest number of significant hidden units. From its definition, El achieves a minimum value of zero if no more than one hidden unit has a nonzero significance, and approaches its upper bound of N(N - 1)/2 as the magnitudes of all weights increase without bound. After applying the gradient descent algorithm, we obtain the learning rule,
where, and X are learning rate parameters. Note that the last term in equation 2.1 couples the dynamics of the weights so that, for example, increasing the significance of the kth hidden unit, respectively increases the decay rate of every weight associated with the other hidden units. Also note that this term is proportional to a'(w,) = 2w,/(l + w:)~, which becomes insignificant for large enough Iw,I. This will help stabilize the dominant weights, but will also, in part, necessitate the second objective stated in the previous section. Because of possible conflicts between the two gradients in equation 2.1, spurious equilibria may exist. It is thus helpful to include the auxiliary term only after the network has learned the training set to a sufficient degree. Consequently, we let X = X(&O) = Xo exp(-PEo), where P-' defines a characteristic standard error: the value of Eo below which the auxiliary term comes into play. Note that the resulting learning rule no longer follows the direction of steepest descent of the combined error function, however, the desired objective is ultimately obtained.
Chuanyi Ji, Robert R. Snapp, and Demetri Psaltis
192
We attain our second objective of reducing the weight magnitudes by subtracting an amount proportional to tanh(wi) from the right-hand side of equation 2.1. Although other choices are possible, this one has shown to be effective in practice. Unlike the weight reduction scheme (Rumelhart 1988)discussed in the introduction, our method preferentially reduces the larger weights in the network. We also apply this term to the threshold's update rule, because, in our examples of interest, we are interested in the region of input space around the origin. We thus obtain the network reduction algorithm,
880 e,"+l = e,"- ?-(wn, en)- ptanh(8;)
84
(2.4)
As before, we gate the influence of the new term on how well the network is learning the training set. In this case, it appears helpful to reduce this term gradually with time. In particular, we let p = po/Eo(w",@") &o(wn-l,
en-l)1.
Once an acceptable level of performance is reached, any weight with magnitude below a certain level is removed from the network. When a hidden unit is connected to the rest of the network with only "removed weights, then the unit is discarded. Thus, as is desired, the number of hidden units is reduced. Finally, we mention that this algorithm can be extended to other architectures. For example, for a network having K inputs and L outputs, one ELl o('u~~)o(w,~), where, 'Uka is the value of the weight may let S, = connecting the kth input to the ith hidden unit, and w,1, is that of the weight connecting the ith hidden unit with the Zth output.
xEl
3 Simulation Results
The first simulation demonstratesthat the modified learning rule reduces the number of hidden units and results in a smooth response function. The training set of the first run consists of 9 equally spaced data points taken from the graph of the function, $J(x)= e-(z-1)2 + e-(++1)2,over the domain [-2,2]. We begin with the network described in Section 2, with N = 20; the 40 weights and 20 thresholds are randomly initialized from a uniform distribution over the interval [-25,251. With learning parameters set to q x 5 x lop3, p = 0.1, Xo = 6.5 x lop3, and po = 5 x lop4, the network is trained by applying the learning rules in equations 2.3 and 2.4. After the value of &O falls below the value 0.05 any weight with a magnitude less than 0.1 is set to zero. The resulting network uses only 5 of the 20 available hidden units. At this point, the nominal increase in &o resulting from eliminating the weaker weights is corrected by training the reduced network with the standard BP algorithm for a
193
Smoothness Constraints from Discrete Samples
-1
6
i
2
3
Network Input
Figure 1: The solid curve indicates the output response of the network with 1 input, 20 hidden units, and 1 output, trained by the network reduction algorithm on the 9 training points indicated by circular markers. The broken curve indicates a response function obtained by training the same network with the same data using BP. The dotted curve indicates the graph of q5(x) from which the training points were selected. few additional iterations. In Figure 1, the response function of the network obtained by this procedure is compared against one obtained by BP (ie., A0 = po = 0) with the same initial conditions. Note that the response function obtained by the network reduction algorithm smoothly interpolates between the 9 training points and possesses the same number of local extrema as @(XI. This cannot be said for the response function generated by BP, which oscillates wildly with minimum and maximum values, -17.5 and 10.9, over the input domain I-3,3]. A more quantitative comparison can be made by averaging the root-mean-square (RMS)deviation between each response function and @(x) over the interval [-2,21, viz.
(Note that & ~ iss a random variable, as the particular determination of the response function, g, depends on the initial values of the weights and thresholds, which are set at random.) For this instance of the net-
Chuanyi Ji, Robert R. Snapp, and Demetri Psaltis
194
work reduction algorithm, we obtain ENS = 1.15 x low2,while for BP, E m s = 1.71. If the algorithm does yield a network that generalizes well, then the size of the network should not depend critically on the number of training samples used. This necessarily assumes that the data in each set faithfully represent the significant features of the training problem. Therefore, in a second run, the network is trained with the same initial state, but with a training set containing 17 equally spaced samples taken from the graph of the same function. Again the algorithm reduces the network to 5 significant hidden units, with E m s = 8.81 x Training the network under BP with this data yields a network with E m s = 4.73 x lo-'. The response functions of these two networks are displayed in Figure 2. Next, we explore an "inverse network" problem: To what extent can one use this algorithm to infer the architecture of a feedforward neural network from only a finite sample of its response function? A network containing one input, a layer of two hidden units, and a single linear output is chosen, and a training set of 17 sample points from its response function is generated. Then, an ensemble of 50 new networks, each
-3
-2
-1
0 Network Input
1
2
3
Figure 2: Same as in Figure 1, but with 17 training points. At the left end of the displayed interval, the network response obtained from BP (the broken curve) drops off scale to -5. At the right end, it quickly climbs to 20, and then saturates.
Smoothness Constraints from Discrete Samples
195
Hidden Frequency E(ERMS In) units ( n ) (V,) 22 0.0275 2 9 0.0263 3 6 0.0405 4 6 0.0401 5 3 0.0656 6 4 0.0700 7
Table 1: Simulation Results for the "Inverse Network Problem. Of a total of 50 runs, the center column shows the number of times a network with n significant hidden units was obtained. The right column equals the average E R M ~- computed over the interval [-2,2] - over the v, runs ending with n significant hidden units. containing 10 hidden units, is trained on the data set with the network reduction algorithm. The results of these 50 simulations are summarized in Table 1. Note that 22 times out of 50 the algorithm finds a network of minimal size. It is encouraging that the response functions with the least average RMS error, measured over the interval I-2,21, come from networks having two or three hidden neurons. In Figure 3, we show how the response functions of the network regress toward that of the "concealed network as the number of hidden units used decreases. For comparison, 10 networks trained by BP alone yield an average RMS error of 0.461 without any apparent reduction in the number of significant hidden units. 4 Concluding Remarks
In the above we have shown that by adding suitably chosen terms to the BP learning rule, desirable global properties in the network's response function can be obtained. In particular, the BP algorithm was tailored to prefer networks having smoother response functions. From the simulations it is apparent that this behavior is attained at the cost of a slower convergence rate. In the first simulation, where &(x) was approximated using a nine point training set, 50,000 iterations were required by the network reduction algorithm, but only 300 were required by BP. This discrepancy was, however, reduced when the networks were trained from the 17 point set: the network reduction algorithm needed 650,000 iterations, and BP, 150,000. Approximately one-half of the 50 "inverse network simulations required more than 800,000 iterations. This is only slightly larger than the 400,000 iterations typically needed to train
Chuanyi Ji, Robert R. Snapp, and Demetri Psaltis
196
(a) n = 2 3.5 3.0 a
2.5 2.0
A 1.5
$.s 1.0 0.5 0.0
- 0 4
-i
-i
a
i
2
Network Input
4
-2
-I
n Network Input
-2
-1
0
1
2
NetworkInput
(d) BEP (n = 10)
(d n = 7
-3
- 0.5- 3
I
2
3
Network Input
Figure 3: Response functions obtained for the "inverse network" problem are displayed as solid curves, that of the "concealed network as a dotted curve, and the training points as circular markers. Graphs (a)+) reflect the median outcomes - the networks resulting in the median values of €RMS - for the sets of trials resulting in TI = 2, 4, and 7 signhcant hidden units, respectively. Graph (d) reflects the median outcome of 10 trials using BP, all of which resulted in 10 signhcant hidden units. Values of E m s were computed over the input interval [-2,21; for the response functions displayed in graphs (a)-fd), E m s equals 2.87 x lop2,3.35 x lo-*, 4.49 x lo-', and 0.291, respectively. a network with 10 hidden units, with the same 17 samples drawn from the "concealed" network, using BP. As BP typically requires a greater number of iterations to learn a given training set on a smaller network, and, as network reduction attempts to select the smallest component of the given network that is just capable of learning the training set, it is not surprising that the network reduction algorithm converges at a slower rate. By a similar argument, we conjecture that network reduction is more effective than training a small network with BP, while gradually adding new hidden units until the network has learned the training set. For example, BP learned the "inverse network problem within 800,000 iterations in only one of 10 trials, on a network containing two hidden
Smoothness Constraints from Discrete Samples
197
units, when the weights were initialized randomly from a uniform distribution over the interval [-25,251. Note, however, that in this case, the convergence rate can be greatly accelerated by chosing initial weight values with smaller magnitudes.
Acknowledgments We would like to thank our colleagues, Mr. Mark Neifeld, Dr. Jeff Yu, Dr. Fai Mok, and Mr. Alan Yamamura for helpful discussions. This work was supported by the JPL Director’s Discretionary Fund and in part by DARPA and the Air Force Office of Scientific Research. References Akaike, H. 1977. On entropy maximization principle. In Applications of Statistics, I? R. Krishnaiah, ed., pp. 2741. North-Holland, Amsterdam. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1,151-160. Man; D. 1981. Vision, p. 114. W. H. Freeman, San Francisco. Poggio, T., and Koch, C. 1985. Ill-posed problems in early vision: From computational theory to analogue networks. Proc. R. SOC. London B 226, 303-323. Rumelhart, D.E. 1988. Presentation given at Hewelett-Packard, Seminar on Neural Networks. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, D.E. Rumelhart and J.L. McClelland, eds., pp. 318-364. MIT Press, Cambridge, MA. Sejnowski, T.J., and Rosenberg, C.R. 1987. Parallel networks that learn to pronounce English text. Complex Syst. 1, 145-168. Tikhonov, A.N., and Arsenin, V.Y. 1977. Solutions of Ill-Posed Problems. Winston, Washington, D.C. Vapnik, V.N., and Chervonenkis, A. Ya. 1968. On the uniform convergence of relative frequencies of events to their probabilities. Dokl. Acad. Nauk S S S R 181(4),781; English translation in Theory Prob. Appl. XVI, 264-280. Werbos, P. 1974. Ph. D. thesis, Harvard University.
Received 15 August 1989; accepted 20 February 1990.
Communicated by David Willshaw
The Upstart Algorithm: A Method for Constructing and Training Feedforward Neural Networks Marcus Frean Department of Physics and Centre for Cognitive Science, Edinburgh University, The Kings Buildings, Mayfield Road, Edinburgh, Scotland
A general method for building and training multilayer perceptrons composed of linear threshold units is proposed. A simple recursive rule is used to build the structure of the network by adding units as they are needed, while a modified perceptron algorithm is used to learn the connection strengths. Convergence to zero errors is guaranteed for any boolean classification on patterns of binary variables. Simulations suggest that this method is efficient in terms of the numbers of units constructed, and the networks it builds can generalize over patterns not in the training set.
1 Introduction The perceptron learning algorithm (Rosenblatt 1962) offers a powerful but restricted method for learning binary classifications. All classifications that can in theory be learned by the perceptron architecture will be learned; however, the number of classificationsit can learn is only a tiny subset (linearly separable patterns) of those that are possible (Minsky and Papert 1969). To perform any arbitrary classification successfully, "hidden" units and/or feedback between units are required. The problem is to train such networks, and recently quite powerful methods have become available, most notably "backpropagation" in its various forms (e.g., Rumelhart, et al. 1986). However, because many of these methods are based on hill climbing, which has the perennial problem of becoming stuck in local optima, they cannot guarantee that the classification will be learned. Another problem is that a priori no realistic estimate can be made of the number of hidden units that are required. Recently, methods such as the tiling algorithm (Mezard and Nadal1989) and others (Gallant 1986a; Nadal1989) have been proposed that get around both these problems. In these, the hidden units are constructed in layers one by one as they are needed. By showing that at least one unit in each layer makes fewer errors than a corresponding unit in the previous layer, eventual convergence to zero errors is guaranteed. Neural Computation 2,198-209 (1990) @ 1990 Massachusetts Institute of Technology
The Upstart Algorithm
199
The method described here also constructs units as it goes, but in a simple and quite different way. Instead of building layers from the input outward until convergence, new units are interpolated between the input layer and the output. The role of these units is to correct mistakes made by the output unit.
2 Basics
Suppose we are given a binary classification to be learned. Each input pattern of N binary values has an associated target output that the network must learn to produce. The units are all linear threshold units connected by variable weights to the inputs, with output o given by 1 if420 o = { 0 otherwise
where N
The W s are the weights, and is the value of the ith input in the given pattern. The necessary threshold or ’%bias’’ is included by having an extra input that is set to 1 for all the input patterns. On presentation of pattern p, perceptron learning (Rosenblatt 1962) alters the weights if the target t p differs from the output: (2.1) Using this method any linearly separable class will be learned, but when the patterns are not linearly separable the values of the weights never stabilize. However, a simple extension called the pocket algorithm (Gallant 1986b) suffices to make the system well behaved. This consists of running a perceptron exactly as above with a random presentation of patterns, but also keeping a copy of the set of weights that has lasted longest without being changed. This set of weights will give the minimum possible number of errors with a probability approaching unity as the training time increases. That is, if a solution giving say p or fewer errors exists then the pocket algorithm can be used to find it (although unfortunately there is no bound known for the training time actually required). I make use of this algorithm to demonstrate convergence to zero errors.
Marcus Frean
200
3 Rationale
The basic idea is that a unit builds other units to correct its mistakes. Any unit (say 2 ) can make two kinds of mistake: “wrongly ON
”
“wrongly OFF”
(4= 1, but t; = 0) (0; = 0,
but tg = 1)
Consider patterns for which 2 is wrongly ON: 2 could be corrected by a large negative weight from a new unit (say X ) , which is active only for those patterns. Likewise when 2 is wrongly OFF it could be corrected by a large positive weight from another unit (say Y ) , which is active at the right time. Hence X and Y (also connected to the input layer by variable weights) can be trained with targets which depend on z ’ s response. These new units might be called “daughters” since they are generated’ by the established ”parent” unit, Z. Consider, for example, the targets we should assign to X , the unit whose role is to inhibit 2. We would like X to be active if 2 was wrongly ON, and silent if 2 was correctly ON. Similarly X should be silent if 2 was wrongly OFF (to avoid further inhibition of 2). Finally, X could be silent if Z was correctly OFF, although if X is active in this case, the effect is merely to reinforce 2’s response when it was already correct. This does not itself cause an error, so in practice we can eliminate these patterns from X’s training set. This elimination makes the problem easier and faster to solve, but is not essential for the error-correcting property described below. Targets for Y are similarly derived. These target assignments are summarized in Figure 1. An important point is that the “raw” output of unit 2 is used to set the daughter’s targets, rather than the value of 2 after the daughters have exerted any effect, since this would introduce feedback. Two useful results follow immediately from this training method, because it essentially gives daughters (Xor Y ) an easier problem to solve than their parent (2). First, daughters can always make fewer errors than their parent, and second, connecting daughter to parent with the appropriate weight will always reduce the errors made by the parent. Proof. 2’s errors are
where e(Z)oN is the number of patterns for which Z is wrongly ON. ‘Note, however, that the activity proceeds from daughter to parent.
The Upstart Algorithm
201
Input
0
O*
0
0
0
1
1
1
0
1
0
O*
tX
t Y
Figure 1: Correcting a parent unit: the left-hand table gives the targets, t x , for the daughter unit X for each combination of ( o z ,t z ) . For example, the lower left-hand entry assigns tx to be 1 when 02 = 1 and t z = 0: the "wrongly ON" case. Similarly the right-hand table gives the values of t y for the daughter unit Y . The dotted line represents the flow of this target information. The "starred" entries correspond to cases where the pattern could be eliminated from the daughter's training set.
If X responded OFF to every pattern, it would make as many errors as there were patterns of target t x = 1. However, X can always d o better than this. In particular, it can always be ON for at least one input pattern
Marcus Frean
202
whose target is 1 and OFF for all other patterns. For example if the input weights are
wi= 2,$
-
1
with a bias weight
only the pth pattern turns the unit ON. Given that the pocket algorithm can find the optimal weights visited by a perceptron with any given probability, at worst X could find the above weights. Therefore e(X) < e(Z)oN 5 e(Z)
(3.1)
A similar argument applies to Y. It also follows that Z’s errors are reduced by X , since e(Z with x) = e(X)oN f e(X)OFF = e(X) -I-e(Z)OFF < e(Z)
f
(3.2)
and similarly for Y on its own. When the joint action of X and Y is considered, the same result holds, that is, e(Z with X,Y) < e(Z) - 1. In the next section an algorithm that uses the first of these results is described. Other possibilities are discussed in Section 6. 4 The Upstart Algorithm
Assume we already have a unit Z that sees input patterns 6;: i = 1,..,N and has associated targets t$. The weights from the input layer to Z are trained to minimize the discrepancies between 2’s output and target and, once trained, these weights remain frozen. This “first” unit is actually the eventual output unit, and its targets are the classification to be learned. The following two steps are then applied recursively, generating a binary branching tree of units. Thus daughter units behave just as 2 did, constructing daughter units themselves if they are required. Step 1. If 2 makes any “wrongly ON” mistakes, it builds a new unit X, using the targets given in Figure 1. Similarly if Z is ever ”wrongly OFF” it builds a unit Y.Apart from the different targets, these units are trained and then frozen just as Z was. Step 2. The outputs of X and Y are connected as inputs to 2.The weight from X is large and negative while that from Y is large and positive. The size of the weight from X[Y] needs to exceed the sum of 2’s positive [negative] input weights, which could be done explicitly or by perceptron learning.
The Upstart Algorithm
203
New units are only generated if the parent makes errors, and the number of errors decreases at every branching. It follows that eventually none of the terminal daughters makes any mistakes, so neither do their parents, and neither do their parents, and so on. Therefore every unit in the whole tree produces its target output, including 2, the output unit. Hence the classification is learned.
5 Simulations In all the simulations shown here the ”starred entries in Figure 1 were not included in a daughter’s training set? To decrease training times, a fast and well behaved version of perceptron learning (Frean 1990) was used to train the weights, in preference to the pocket algorithm. While this method is not guaranteed to find the optimal weights, in practice it produces substantially fewer errors in a given time than the pocket algorithm. The weight changes given by the usual perceptron learning rule (equation 2.1) were simply multiplied by
This factor decreases with 141, which measures how ”serious” the error is. The rationale behind this approach is that an error where 141 is large is difficult to correct without causing other errors and should therefore be weighted as less significant than those where 141 is small. The ”temperature” T controls how strongly this weighting is biased to small 141. T was reduced linearly from To to zero over the entire training period. At high T the perceptron rule is recovered, but as T decreases the weights are “frozen.” Unless otherwise stated, TOwas set to 1, and 1000 passes were made through the training set. 5.1 Parity. In this problem the output should be ON if the number of active inputs is odd, and OFF if it is even. Parity is often cited as a difficult problem for neural networks to learn, being a predicate of order N (Minsky and Papert 1969); that is, at least one hidden unit must sample all of the N inputs. It is also of interest because there is a known solution consisting of a single layer of N hidden units projecting to an output unit. It is easy to see how the upstart algorithm tackles parity (see Fig. 2). Essentially the same structure as that shown for N = 3 would arise for any N , although for large problems the optimal weights become much harder to find. I have tried parity for up to N = 10, and in all cases N units are constructed, including the output unit. In all cases except 21f the whole training set is used in every case, the number of units produced is relatively unaffected for the problems investigated here, but the training time (a combination of the time per epoch and number of epochs required to generate a comparable network) is approximately doubled for the problems discussed.
Marcus Frean
204
On (odd)
Figure 2: Solution for 3-bit parity. The output unit 2 on its own can clearly make a minimum of two mistakes, when the plane defined by its weights cuts the cube as shown. X corrects the wrongly ON pattern by responding to it alone, and similarly Y corrects the wrongly OFF pattern.
N = 10, 1000 passes were sufficient to generate the minimal tree. For 10-bit parity, the figure was 10,000. 5.2 Random Mappings. In this problem the binary classification is defined by assigning each of the 2N patterns its target 0 or 1 with 50% probability. Again this is a difficult problem, due to the absence of correlations and structure in the input for the network to exploit. The networks obtained for N up to 10 are summarized in Figure 3, with comparisons to the tiling algorithm.
5.3 Generalization. Neural networks are often ascribed the property of generalization: the ability to perform well on all patterns taken from
205
The Upstart Algorithm
250
Tiling 0 Upstart
200
150
100
50
0 5 6
7
8
9
10
N
Figure 3: Number of units built vs. the number of patterns ( Z N ) for the random mapping problem. The slope of the upstart line is approximately 1/9. Each point is an average of 25 runs, each on a different training set.
Marcus Frean
206
a given distribution after having seen only a subset of them. Several workers (Denker et al. 1987; Mezard and Nadal 1989) have looked at generalization using the "2-or-more clumps" predicate. The problem is this: given an input pattern, respond ON if the number of clumps is 2 or greater and OFF otherwise, where a "clump" is a group of adjacent3 1's bounded on either side by 0's. As with parity, there is a solution consisting of a single hidden layer of N units that would solve the problem exactly. Following Mezard and Nadal, the patterns were generated by a Monte Carlo method (Binder 1979) such that the mean number of clumps is 1.5. I used N = 25 input units, with a training set of up to 600 patterns. The set used to test the resulting net's ability to generalize was a further 600 patterns. The results, with comparisons to the tiling algorithm, are summarized in Figure 4. 6 Discussion
The architecture generated by this procedure is unconventional in that it has a hierarchical tree structure. However, in the case where we choose not to eliminate any training patterns there is an equivalent structure with the same units arranged as a single hidden layer. Consider two daughters (say X , Y ) and their parent (2).With primes denoting "corrected" values, the corrected value 0; is always equal to oz - 0; + 0;. This formulation is possible because the combinations that would disobey this never occur. For example, Y would never be correctly ON if 2 was ON. Since this is true of every unit in the tree, the final output is simply a sum of the "raw" responses. For example, output =
0;
= O A - 05 + OC
+
' '*
-
OX
+ OY + 02
Imagine the tree units disconnected from one another and placed in a single layer. A new output unit connected to this "hidden" layer can easily calculate the appropriate sum by, for example, having weights of +I from each unit that adds to the sum and -1 from each unit that subtracts, with a bias of zero. In effect we can convert a binary tree into a single hidden layer architecture that implements the same mapping, at the expense of adding one unit and being unable to exploit pattern elimination. The algorithm for constructing a single hidden layer architecture is as follows: construct units as before, omitting step 2 (where they are connected into a feedforward tree). Then connect all the units so constructed to a new output unit. The weights can be learned by perceptron learning, or can be inferred from the tree structure: there is a sign reversal for every "X" daughter. 6.1 Extensions. The upstart method can be extended in a number of ways. First, we are not restricted to binary branching trees. Having 3Circularboundary conditions are used input 1 is "adjacent" to input N .
207
The Upstart Algorithm
Tiling Q
Upstart
0 100
go
80 L
8 0 Y
a
2
-
70
v)
a
$
60 I
50
0
100
200
300
400
500
600
Number of training patterns
Figure 4 Performance of the method on the "2-or-more clumps" problem. The lower graph shows the percentage generalization as the size of the training set is increased. Plotted above this and with the same abscissa is the size of the corresponding network. To = 4.0. There are 25 runs per point, each on a different set. Where not shown, error bars are smaller than the points.
208
Marcus Frean
trained a daughter unit and connected it to the output, a new daughter can now be trained using targets derived from the partially corrected output, instead of the daughter, and so on until no more mistakes are made. This would build a single hidden layer. Hybrid methods, building trees of variable breadth, will also work. Second, these algorithms can be extended to problems involving multiple output units. A good method should build considerably fewer units than would be obtained by treating each output separately (especially if the output targets are correlated); in other words, maximum mutual use should be made of hidden units. Consider the following algorithm, where steps 1 and 2 are repeated until every output unit makes no mistakes: Start. There are no hidden units and no connections, so the output units are always OFF. Step 1. Choose an output unit (say, the one which makes the most errors). Build the appropriate hidden unit to correct some of the mistakes being made by this output unit, as described above. Connect this new unit to all the output units. Step 2. Train the weights from each unit in this enlarged hidden layer to each of the output units. Reevaluate the numbers of errors made by each output unit. Hence a single hidden layer is constructed. 6.2 Conclusions. The design of general purpose supervised learning algorithms for neural networks involves two important considerations: the network should succeed in correctly classifying the patterns on which it is trained, and nontrivial solutions should involve as few computational elements as possible, avoiding redundant computation. The upstart algorithm can build a network to implement correctly any boolean mapping. Because each “daughter” cell makes as few errors as possible, it corrects as many “parent” errors as possible, which results in small networks. These networks are smaller than those produced by the tiling algorithm. In general the minimum number of units required for any given problem cannot be calculated. However, for a few special cases such as parity and the clumps problem it is known that N or fewer units are needed, and the upstart algorithm achieves this. In the potentially worst case where the targets are randomly assigned, m patterns can be correctly classified by approximately m/9 units. The basic idea can be implemented in different architectures and is extendable to the case of multiple outputs.
Acknowledgments I am grateful to David Willshaw who helped greatly with the manuscript, and also David Wallace for useful comments. I particularly thank Peter
The Upstart Algorithm
209
Dayan for suggesting that the tree could be "squashed" into a single hidden layer, as well as for helpful discussions.
References Binder, K. 1979. Monte Carlo methods in statistical physics. Topics Curr. Phys. 7. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., and Hopfield, J. 1987. Large automatic learning, rule extraction and generalization. Complex Syst. I, 877-922. Frean, M. R. 1990. A "Thermal" perceptron for efficient linear discrimination. Unpublished. Gallant, S. I. 1986a. Three constructive algorithms for network learning. Proc. 8th Ann. Conf. Cog. Sci. Soc., Amherst, MA, Aug. 15-17, pp. 652-660. Gallant, S. I. 1986b. Optimal linear discriminants. IEEE Proc. 8th Conf. Puttern Recognition, Paris, Oct. 28-31, pp. 849-852. Mezard, M., and Nadal, J. 1989. Learning in feedforward layered networks: The tiling algorithm. J. Phys. A 22,12,2191-2203. Minsky, M., and Papert, S. 1969. Perceptrons. MIT Press, Cambridge. Nadal, J. 1989. Study of a growth algorithm for neural networks. Int. 1. Neurul Sysf. 1,5559. Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, New York. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume I Foundations, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, eds. MIT Press, Cambridge.
Received 12 December 1989; accepted 19 March 1990.
Communicated by Halbert White
Layered Neural Networks with Gaussian Hidden Units as Universal Approximations Eric J. Hartman James D. Keeler Microelectronics and Computer Technology Corp. (MCC), 3500 West Balcones Center Drive, Austin, TX 78759-6509 USA
Jacek M. Kowalski Department of Physics, University of North Texas, P.O. Box 5368, Denton, TX 76203-5368 USA
A neural network with a single layer of hidden units of gaussian type is proved to be a universal approximator for real-valued maps defined on convex, compact sets of R”. 1 Introduction
Neural networks functioning as approximators of general maps are currently under intense investigation, with concentration on network approximation capabilities and performance for different architectures and different types of hidden units. A very important class of applications is nonlinear signal processing, particularly the prediction problem for a chaotic, deterministic time series. In this case a network learns input sequences and produces a global approximation to an unknown, in general, map of the Takens type (see, e.g., Eckmann and Ruelle 1985) for a deterministic system on its attractor. Such systems may tell the difference between purely random and deterministic processes, and in the latter case allow longer time predictions. The overall performance of various proposed neural net schemes is good for quite different types of hidden units. Lapedes and Farber (1987) used an architecture with two layers of standard “sigmoid” hidden units. They tested the network prediction capability for two model systems: the logistic map and the Mackey-Glass delay equation with its “tunable” attractor characteristics. Nonstandard hidden units with localized receptive fields have been considered in a series of papers by Moody and Darken (1989a,b), Moody (19891, Lee and Kil (1988), and Niranjan and Fallside (1988). The success of approximation schemes with standard (sigmoidal)processing units opened the question of how “good these devices are as approximators in a given functional space. The network universality Neural Computation 2,210-215 (1990) @ 1990 Massachusetts Institute of Technology
Neural Networks with Gaussian Units
211
property was rigorously proved by Hornik et al. (1989) and Stinchcombe and White (1989) for a broad class of single or multilayer networks where the hidden units are described by any continuous, nonconstant function G : R” + R. These units are assumed to be “semiaffine,” that is, they process their inputs via composition G . A, where A : R” 4 R is an affine map: A(x) = (w,x) + b, where x is an input vector, w is a vector of ”weights,” b is a scalar ”bias” parameter, and (w,x) denotes the inner product. According to the main result in Hornik et al. (19891, any real-valued continuous function with compact domain can be arbitrarily closely approximated by a network of this type provided a sufficient number of the G . A hidden units. Additionally, networks with sigmoidal hidden units were proved to be ”almost always” universal approximators, under very mild assumptions on the class of approximated functions (Bore1 measurability), where these are functions selected from a general “environment” characterized by a measure with possible probabilistic interpretation. A different version of the approximation theorem has been recently proved by Funahashi (19891, for networks with standard sigmoid hidden units, and sigmoidal G-function. The proof in Funahashi (19891, different in spirit from that in Hornik et al. (1989), is based on some general properties of mollifiers and the Irie-Miyake theorem. Other approximation schemes, noted above, use hidden units described by so-called “radial basis functions,” g : R” + R, g(x) = H(llx - x,ll), where H is some smooth real function of the distance IIx - x,II from the ”center” x, in the input space. In particular, Moody and Darken (1989a,b) worked with the same Mackey-Glass equation as in Lapedes and Farber (1987) using a single layer of hidden units, each unit described by a radial basis function (typically a gaussian) with adjustable parameters. These authors reported faster learning schemes with backpropagation restricted to the weights of linear output and preprocessing used within the hidden layer [a cluster algorithm in the input space to select unit centers and some additional procedures to adjust “widths” (or ranges) of each gaussianl. Radial basis functions have an argument nonlinearly dependent on the input vector x and hence the theorems proved in Hornik et al. (1989) (purportedly tailored to a “standard” architecture with semiaffine units) are not directly applicable. That mentioned in Stinchcombe and White (1989) for the case of semiaffine “gaussians” corresponds to the degenerate case for dimensions n > 2, with singular matrix of the related quadratic form. The correlation matrix does not exist in this case and gaussian functions considered in Moody and Darken (1989a,b), Moody (1989), Lee and Kil (1988), and Niranjan and Fallside (1988) are not, in general, semiaffine. In this note we point out that versions of the Stone-Weierstrass theorem [a basic tool in Hornik et al. (1989) and Stinchcombe and White (1989)l apply even more directly to gaussian radial basis hidden units. In particular, linear combinations of gaussians over a compact, convex set form an algebra of maps (see Section 2 ) . The ”universal approximation” theo-
Eric J. Hartman, James D. Keeler, and Jacek M. Kowalski
212
rem immediately follows, for simple, single hidden-layer neural nets, as used in Moody and Darken (1989a,b). This puts on a firmer basis the use of gaussian hidden units, previously motivated by computational convenience and supported by some biological evidence (see Section 3). In Section 3 we also address some related questions [i.e., recently proposed generalized units (Durbin and Rumelhart 1989)l and list some open problems.
n
2 Gaussian Hidden Units as Universal Approximators
Let x, and xp be two given points in the R" space. Consider a "weighted distance" map h : R" + R given by
where a, and gives
up
are real positive constants. Straightforward algebra
h(x) = (a, + ap)llx - qll 2 - a,ap(a, + ap)-lllx, - Xp1l2
where x-, is a convex linear combination of x, and q = c,x, + cpxp, c1 = a,(a, + u p ) -1 c2 =
ap(a,
(2.2)
xp:
+ ap)-l c1 + c2 = 1
(2.3)
As a next step, consider a two-parameter family F of restricted gaussians :K R fa,,x,(x) = exp(-a,IIx-x,11*),
fa,,x,
+
a, > O,x, E K,x E K
(2.4)
where K is any convex compact subset of R". Let C be the set of all finite linear combinations with real coefficients of elements from T . Multiplying two elements from C one obtains linear combinations of products of gaussians. Such linear combinations are still elements of L as fa,,x, (x)fa@,x@(x) = Da@fa,+aa,x,(XI
(2.5)
where D,p is a positive constant given by D,p = exp[--a,ap(a,
+ ap)-'((x,
-
xp((2 1
(2.6)
Equation (2.5) merely represents a well-known fact that a product of two gaussians is a gaussian (see, e.g., Shavitt 1963). For our purposes it is important, however, that the center of a "new" gaussian is still in K if K is convex. Thus C is an algebra of gaussians on K , as C is closed with respect to multiplication: (2.7)
Neural Networks with Gaussian Units
213
It is trivial to observe that the algebra C separates (distinguishes) points of K , that is, that for arbitrary XI, x2 E K , x1 x2 there is a function f a , , x , in C such that fa,,x,(xd fa,,x,(x~)- (Pick, e.g., x, = XI and use monotonicity of exponentials.) Moreover, L does not vanish at any point of K (i.e., for every point x E K there is obviously a function in C different from zero there.) The above observations allow an immediate use of the Stone's theorem as formulated, e.g., in Rudin (1958). For the reader's convenience we quote this theorem.
+
+
Theorem 1 (Stone.) Let C be an algebra of real continuous functions on a compact set K , and suppose C separates points of K and does not vanish at any point of K . Then C,, the uniform closure of C, contains all real-valued, continuous functions on K .
C,, the uniform closure of C, is the set of all functions obtained as limits of uniformly convergent sequences of elements from L. Obviously, C is a subalgebra of the algebra C ( K ) of all continuous functions on K with the supremum norm. It follows that L is dense in C(K). Thus we arrive at the following corollary: Corollary. Let C be the set of all finite linear combinations with real coefficients of elements from F,the set of gaussian radial basis functions defined on the convex compact subset K of R" (see equation 2.4). Then any function in C(K) can be uniformly approximated to an arbitrary accuracy by elements of C. Another version of the Stone theorem (see, e.g., Lorentz 1986) is directly related to more complicated hidden units. Let G = {g} be a family of continuous real-valued functions on a compact set A c R". Consider generalized polynomials in terms of the g-functions: (2.8) where are real coefficients and ni are real nonnegative integers. Again, if the family G distinguishes points of K , and does not vanish on K , then each continuous real function can be approximated to arbitrary accuracy by polynomials of the type in 2.8. 3 Comments and Final Remarks
Special properties of the gaussian family allowed a simple proof of the approximation theorem for single hidden-layer networks with a linear output. The condition that gaussians have to be centered at points in K is not very restrictive. Indeed, one can include any reasonable closed "receptor field" S (domain of an approximated map) into a larger convex
Eric J. Hartman, James D. Keeler, and Jacek M. Kowalski
214
and compact set K , and continuously extend functions on S onto K (Tietze’s extension theorem). For a given (or preselected) set of gaussian centers and standard deviations, one can, of course, find the coefficients of the ”best” representation by the least-squares method. Under the additional assumption that the approximated functions are themselves members of the algebra L, one may still have a situation where the number of gaussian components, location of their centers, and standard deviations is unknown (a standard problem in spectral analysis). The procedure proposed by Medgyessy (1961) allows the unique determination of parameters. The general version of the StoneWeierstrass theorem with the generalized polynomials (see equation 2.8) is ideally suited for architectures using C units. However, it cannot be applied to complex-valued functions (Lorentz 1986). Hence, the problem of the approximation abilities of a new type of n-networks, recently proposed by Durbin and Rumelhart (19891, remains open. In general situations, the theoretical problem of an optimal approximation with varying number of radial basis functions, varying locations and ranges is much more subtle. Selecting some subsets A of the input space (restricting the environment) one may look for an extremal gaussian subspace (”the best approximation system” for all inputs of the type A), or try to characterize A by its metric entropy (Lorentz 1986).
n
References Durbin, R., and Rumelhart, D.E. 1989. Product units: A computationally powerful and biologically plausible extension to back propagation networks. Neural Comp. 1, 133. Eckmann, J.P., and Ruelle, D. 1985. Ergodic theory of chaos and strange attractors. Rev. Mod. Phys. 57, 617. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183. Hornik, M., Stinchcornbe, M. and White, H. 1989. Multilayer feedforward networks are universal approximators. NeuraZ Networks, in press. Lapedes, A., and Farber, R. 1987. Nonlinear Signal Processing Using Neural Networks: Prediction and System Modeling. Tech. Rep. LA-UR-88-418, Los Alamos National Laboratory, Los Alamos, NM. Lee, S., and Kil, R.M. 1988. Multilayer feedforward potential function network, 1-161. lEEE Int. Conf. Neural Networks, San Diego, CA. Lorentz, G.G. 1986. Approximations of Functions. Chelsea Publ. Co., New York. Medgyessy, P. 1961. Decomposition of Superpositions of Distribution Functions, Publishing House of the Hungarian Academy of Sciences, Budapest. Moody, J. 1989. Fast learning in multi-resolution hierarchies. Yale Computer Science, preprint.
Neural Networks with Gaussian Units
215
Moody, J., and Darken, C. 1989a. Learning with localized receptive fields. In Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, eds. Morgan Kaufmann, San Mateo, CA. Moody, J., and Darken, C. 1989b. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294. Niranjan, M., and Fallside, F. 1988. Neural networks and radial basis functions in classdying static speech patterns. Engineering Dept., Cambridge University, CLJED/F-INFENG/TR22. Neural Networks 2, 359. Rudin, W. 1958. Principles of Mathematical Analysis, 2nd rev. ed. McGraw-Hill, New York, Toronto, London. Shavitt, I.S. 1963. The Gaussian function in calculations of statistical mechanics and quantum mechanics. In Methods in Computational Physics: Quantum Mechanics, Vol. 2, pt. 1, B. Alder, S. Fernbach, and M. Rottenberg, eds. Academic Press, New York. Stinchcombe, M., and White, H. 1989. Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions. Conf. Neural Networks, Washington, D.C., IEEE and INNS, 1,613.
Received 18 August 1989; accepted 9 February 1990.
Communicated by Richard Lippmann
A Neural Network for Nonlinear Bayesian Estimation in Drug Therapy Reza Shadmehr Department o f Computer Science, University of Southern California, Los Angeles, CA 90089 USA
David Z. DArgenio Department o f Biomedical Engineering, University of Southern California, Los Angeles, CA 90089 USA
The feasibility of developing a neural network to perform nonlinear Bayesian estimation from sparse data is explored using an example from clinical pharmacology. The problem involves estimating parameters of a dynamic model describing the pharmacokinetics of the bronchodilator theophylline from limited plasma concentration measurements of the drug obtained in a patient. The estimation performance of a backpropagation trained network is compared to that of the maximum likelihood estimator as well as the maximum a posteriori probability estimator. In the example considered, the estimator prediction errors (model parameters and outputs) obtained from the trained neural network were similar to those obtained using the nonlinear Bayesian estimator.
1 Introduction The performance of the backpropagation learning algorithm in pattern classification problems has been compared to that of the nearest-neighbor classifier by a number of investigators (Gorman and Sejnowski 1988; Burr 1988; Weideman et al. 1989). The general finding has been that the algorithm results in a neural network whose performance is comparable (Burr 1988; Weideman et al. 1989) or better (Gorman and Sejnowski 1988) than the nearest-neighbor technique. Since the probability of correct classification for the nearest-neighbor technique can be used to obtain upper and lower bounds on the Bayes probability of correct classification, the performance of the network trained by Gorman and Sejnowski (1988) is said to have approached that of a Bayes decision rule. Benchmarking the backpropagation algorithm's performance is necessary in pattern classification problems where class distributions intersect. Yet few investigators (Kohonen et al. 1988) have compared the performance of a backpropagation trained network in a statistical Neural Computation 2,216-225 (1990) @ 1990 Massachusetts Institute of Technology
Neural Network for Bayesian Estimation
217
pattern recognition or estimation task, to the performance of a Bayesian or other statistical estimators. Since Bayesian estimators require a priori knowledge regarding the underlying statistical nature of the classification problem, and simplifying assumptions must be made to apply such estimators in a sparse data environment, a comparison of the neural network and Bayesian techniques would be valuable since neural networks have the advantage of requiring fewer assumptions in representing an unknown system. In this paper we compare the performance of a backpropagation trained neural network developed to solve a nonlinear estimation problem to the performance of two traditional statistical estimation approaches: maximum likelihood estimation and Bayesian estimation. The particular problem considered arises in the field of clinical pharmacology where it is often necessary to individualize a critically ill patient’s drug regimen to produce the desired therapeutic response. One approach to this dosage control problem involves using measurements of the drug’s response in the patient to estimate parameters of a dynamic model describing the pharmacokinetics of the drug (i.e., its absorbtion, distribution, and elimination from the body). From this patient-specific model, an individualized therapeutic drug regimen can be calculated. A variety of techniques have been proposed for such feedback control of drug therapy, some of which are applied on a routine basis in many hospitals [see Vozeh and Steimer (1985) for a general discussion of this problem]. In the clinical patient care setting, unfortunately, only a very limited number of noisy measurements are available from which to estimate model parameters. To solve this sparse data, nonlinear estimation problem, both maximum likelihood and Bayesian estimation methods have been employed (e.g., Sheiner et al. 1975, Sawchuk et al. 1977). The a priori information required to implement the latter is generally available from clinical trials involving the drug in target patient populations. 2 The Pharmacotherapeutic Example
The example considered involves the drug theophylline, which is a potent bronchodilator that is often administered as a continuous intravenous infusion in acutely ill patients for treatment of airway obstruction. Since both the therapeutic and toxic effects of theophylline parallel its concentration in the blood, the administration of the drug is generally controlled so as to achieve a specified target plasma drug concentration. In a population study involving critically ill hospitalized patients receiving intravenous theophylline for relief of asthma or chronic bronchitis, Powell et al. (1978) found that the plasma concentration of theophylline, y(t), could be related to its infusion rate, r(t), by a simple one-compartment, two-parameter dynamic model [i.e., d y ( t ) / d t = -(CL/V)y(t)+r(t)/V].In the patients studied (nonsmokerswith no other
218
Reza Shadmehr and David Z . DArgenio
organ disfunction), significant variability was observed in the two kinetic model parameters: distribution volume V (liters/kg body weight) = 0.50 & 0.16 (mean f SD); elimination clearance CL (liters/kg/hr) = 0.0386 f 0.0187. In what follows, it will be assumed that the population distribution of V and C L can be described by a bivariate log-normal density with the above moments and a correlation between parameters of 0.5. For notational convenience, a will be used to denote the vector of model parameters (a = [V CLIT) and p and R used to represent the prior mean parameter vector and covariance matrix, respectively. Given this a priori population information, a typical initial infusion regimen would consist of a constant loading infusion, T I , equal to 10.0 mg/kg/hr for 0.5 hr, followed by a maintenance infusion, rz, of 0.39 mg/kg/hr. This dosage regimen is designed to produce plasma concentrations of approximately 10 pg/ml for the patient representing the population mean (such a blood level is generally effective yet nontoxic). Because of the significant intersubject variability in the pharmacokinetics of theophylline, however, it is often necessary to adjust the maintenance infusion based on plasma concentration measurements obtained from the patient to achieve the selected target concentration. Toward this end, plasma concentration measurements are obtained at several times during the initial dosage regimen to estimate the patient's drug clearance and volume. We assume that the plasma measurements, z(t), can be related to the dynamic model's prediction of plasma concentration, y ( t , a), as follows: z ( t ) = y(t, a ) + e(t). The measurement error, eW, is assumed to be an independent, Gaussian random variable with mean zero and standard deviation of ~ ( a=)0.15 x y(t,a). A typical clinical scenario might involve only two measurements, z ( t l ) and where tl = 1.5 hr and t 2 = 10.0 hr. The problem then involves estimating V and C L using the measurements made in the patient, the kinetic model, knowledge of the measurement error, as well as the prior distribution of model parameters. 3 Estimation Procedures
Two traditional statistical approaches have been used to solve this sparse data system estimation problem: maximum likelihood ( M L ) estimation and a Bayesian procedure that calculates the maximum a posteriori probability ( M A P ) . Given the estimation problem defined above, the M L estimate, a M L of , the model parameters, a, is defined as follows: (3.1)
Neural Network for Bayesian Estimation
219
where z = [z(t1)z(t2)IT,y(cy) = [ y ( t l , a ) y ( t 2 , a ) l T ,and C(a) = diag {ut1( a )ot,(a)}.The MAP estimator is defined as follows:
wherev = {vi},i= 1,2,@= {&},i = j = 1,2, with vi = 1npi-4%i/2,2= 1,2, and & j = l n ( ~ i ~ / p i j p i j +i,l )j , = 1,2.The mean and covariance of the prior parameter distribution, p and 0 (see above), define the quantities pi and wij. Also, A(a) = diag(Ina1 Inaz}. The corresponding estimates of the drug's concentration in the plasma can also be obtained using the above parameter estimates together with the kinetic model. To obtain the M L and M A P estimates a general purpose pharmacokinetic modeling and data analysis software package was employed, which uses the NelderMead simplex algorithm to perform the required minimizations and a robust stiff /norutiff differential equation solver to obtain the output of the kinetic model (DArgenio and Schumitzky 1988). As an alternate approach, a feedforward, three-layer neural network was designed and trained to function as a nonlinear estimator. The architecture of this network consisted of two input units, seven hidden units, and four output units. The number of hidden units was arrived at empirically. The inputs to this network were the patient's noisy plasma samples z(t1) and z(tZ), and the outputs were the network's estimates for the patient's distribution volume and elimination clearance (a") as well as for the theophylline plasma concentration at the two observation times Iy(td, y(t2)I. To determine the weights of the network, a training set was simulated using the kinetic model defined above. Model parameters (1000 pairs) were randomly selected according to the log-normal prior distribution defining the population (ai,i = 1,.. . , lOOO), and the resulting model outputs determined at the two observation times [y(tl, ai), y(t2, ah,i = I , . . . ,10001. Noisy plasma concentration measurements were then simulated [ ~ ( t lz(t2)i, ) ~ , i = 1,. . . ,10001 according to the output error model defined previously. From this set of inputs and outputs, the backpropagation algorithm (Rumelhart et al. 1986) was used to train the network as follows. A set of 50 vectors was selected from the full training set, which included the vectors containing the five smallest and five largest values of V and CL. After the vectors had been learned, the performance of the network was evaluated on the full training set. Next, 20 more vectors were added to the original 50 vectors and the network was retrained. This procedure was repeated until addition of 20 new training vectors did not produce appreciable improvement in the ability of the network to estimate parameters in the full training set. The final network was the result of training on a set of 170 vectors, each vector being presented
Reza Shadmehr and David Z. D'Argenio
220
to the network approximately 32,000 times. As trained, the network approximates the minimum expected (over the space of parameters and observations) mean squared error estimate for a , y ( t l ) and y(t2). [See Asoh and Otsu (1989) for discussion of the relation between nonlinear data analysis problems and neural networks.] 4 Results
The performance of the three estimators ( M L , M A P , N N ) was evaluated using a test set (1000 elements) simulated in the same manner as the training set. Figures 1 and 2 show plots of the estimates of V and CL, respectively, versus their true values from the test set data, using each of the three estimators. Also shown in each graph are the lines of regression (solid line) and identity (dashed line). To better quantify the performance of each estimator, the mean and root mean squared prediction error ( M p e and RMSpe, respectively) were determined for each of the two parameters and each of the two plasma concentrations. For example, the prediction error (percent) for the N N volume estimate was calculated as pe, = (y" - V,)lOO/V,, where V , is the true value of volume for the ith sample from the test set and y" is the corresponding N N estimate. Table 1 summarizes the resulting values of the Mpe for each of the three estimators. From inspection of Table 1we conclude that the biases associated with each estimator, as measured by the Mpe for each quantity, are relatively small, and comparable. As a single measure of both the bias and variability of the estimators, the R M S p e given in Table 2 indicate that, with respect to the parameters V and CL, the precision of the N N and M A P estimators is similar and significantly better than that of the M L estimator in the example considered here. For both the nonlinear maximum likelihood and Bayesian estimators, an asymptotic error analysis could be employed to provide approximate errors for given parameter estimates. In an effort to supply some type of
Estimator ML MAP "
2.5 3.4 1.0 6.1 4.7 3.8
-1.1 0.8 0.6
-3.0 1.5 7.3
Table 1: Mean Prediction Errors ( M p e ) for the Parameters (V and C L ) and Plasma Concentrations [y(tl) and y(tz)] as Calculated, for Each of the Three Estimators, from the Simulated Test Set.
221
Neural Network for Bayesian Estimation
1.50
-
3 9 s>
075
0 00
1:
0.00
4 0
0.75
1.50
Figure 1: Estimates of V for the M L , M A P , and N N procedures (top to bottom), plotted versus the true value of V for each of the 1000 elements of the test set. The corresponding regression lines are as follows: V M L= l.OV+0.004, r2 = 0.74; V M A P= 0.80V + 0.094, r2 = 0.81; V” = 0.95V + 0.044, r2 = 0.80.
Reza Shadmehr and David Z. D'Argenio
222
,
"'"1
, .
OW
I
0
0.075
0.150
CL (Llkglhr)
Figure 2 Estimates of C L for the ML, MAP, and N N procedures (top to bottom), versus their true values as obtained from the test set data. The corresponding regression lines are as follows: C L M L= 0.96CL + 0.002, r2 = 0.61; CLMAP= 0.73CL+ 0.010, r2 = 0.72; CL" = 0.69CL + 0.010, r2 = 0.69.
Neural Network for Bayesian Estimation
Estimator ML MAP NN
V 21. 14. 16.
223
RMSpe (%I C L y(t1) 44. 16. 30. 12. 13. 31.
Ye21
16. 13. 14.
Table 2: Root Mean Square Prediction Errors ( R M S p e ) for Each Estimator. error analysis for the N N estimator, Figure 3 was constructed from the test set data and estimation results. The upper panel shows the mean and standard deviation of the prediction error associated with the N N estimates of V in each of the indicated intervals. The corresponding results for C L are shown in the lower panel of Figure 3. These results could then be used to provide approximate error information corresponding to a particular point estimate (V" and CL") from the neural network. 5 Discussion
These results demonstrate the feasibility of using a backpropagation trained neural network to perform nonlinear estimation from sparse data. In the example presented herein, the estimation performance of the network was shown to be similar to a Bayesian estimator (maximum a posteriori probability estimator). The performance of the trained network in this example is especially noteworthy in light of the considerable difficulty in resolving parameters due to the uncertainty in the mapping model inherent in this estimation problem, which is analogous to intersection of class distributions in classification problems. While the particular example examined in this paper represents a realistic scenario involving the drug theophylline, to have practical utility the resulting network would need to be generalized to accommodate different dose infusion rates, dose times, observation times, and number of observations. Using an appropriately constructed training set, simulated to reflect the above, it may be possible to produce such a sufficiently generalized neural network estimator that could be applied to drug therapy problems in the clinical environment. It is of further interest to note that the network can be trained on simulations from a more complete model for the underlying process (e.g., physiologically based model as opposed to the compartment type model used herein), while still producing estimates of parameters that will be of primary clinical interest (e.g., systemic drug clearance, volume of distribution). Such an approach has the important advantage over traditional statistical estimators of building into the estimation procedure robustness to model simplification errors.
Reza Shadmehr and David Z . IYArgenio
224
40-
30
-
8
v
z
?
20-
K 10
-
OJ
+/0.30 I
0
I
I
I
I
0.45
0.60
0.75
0.90
V”
f -
1.50
(Llkg)
401 30
c
-
20-
%
8 a
10
-
4
0-
-10
-
Figure 3: Distribution of prediction errors of volume (upper) and clearance (lower) for the N N estimator as obtained from the test set data. Prediction errors are displayed as mean (e) plus one standard deviation above the mean.
Acknowledgments This work was supported in part by NIH Grant P41-RRO1861.R.S. was supported by an IBM fellowship in Computer Science.
References Asoh, H., and Otsu, N. 1989. Nonlinear data analysis and multilayer perceptrons. IEEE Int. Joint Conf. Neural Networks 11, 411-415. Burr, D. J. 1988. Experiments on neural net recognition of spoken and written text. IEEE Trans. Acoustics Speech, Signal Processing 36, 1362-1165.
Neural Network for Bayesian Estimation
225
DArgenio, D. Z., and Schumitzky, A. 1988. ADAPT I1 User’s Guide. Biomedical Simulations Resource, University of Southern California, Los Angeles. Gorman, R. P., and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75-89. Kohonen, T., Barna, G., and Chrisley, R. 1988. Statistical pattern recognition with neural networks: Benchmarking studies. l E E E Int. Conf. Neural Networks 1, 61-68. Powell, J. R., Vozeh, S., Hopewell, P., Costello, J., Sheiner, L. B., and Riegelman, S. 1978. Theophylline disposition in acutely ill hospitalized patients: The effect of smoking, heart failure, severe airway obstruction, and pneumonia. Am. Rev. Resp. Dis. 118,229-238. Rumelhart, D.E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by backpropagation errors. Nature 323,533-536. Sawchuk, R. J., Zaske, D. E., Cipolle, R. J., Wargin, W. A., and Strate, R. G. 1977. Kinetic model for gentamicin dosing with the use of individual patient parameters. Clin. Pharrnacol. Thera. 21, 362-369. Sheiner, L. B., Halkin, H., Peck, C., Rosenberg, B., and Melmon, K. L. 1975. Improved computer-assisted digoxin therapy: A method using feedback of measured serum digoxin concentrations. Ann. Intern. Med. 82, 619-627. Vozeh, S., and Steimer, J. -L. 1985. Feedback control methods for drug dosage optimisation; Concepts, classification and clinical application. Clin. Pharmacokinet. 10, 457476. Weideman, W. E., Manry, M. T., and Yau, H. C. 1989. A comparison of a nearest neighbor classifier and a neural network for numeric hand print character recognition. IEEE lnt. Joint Conf. Neural Networks 1, 117-120.
Received 16 November 1989; accepted 6 February 1990.
Communicated by Jack Cowan
Analysis of Neural Networks with Redundancy Yoshio IzuP Alex Pentland Vision Science Group, El 5-387, The Media Laboratory, Massachusetts Institute of Technology, 20 Ames Street, Cambridge, M A 02139 USA
Biological systems have a large degree of redundancy, a fact that is usually thought to have little effect beyond providing reliable function despite the death of individual neurons. We have discovered, however, that redundancy can qualitatively change the computations carried out by a network. We prove that for both feedforward and feedback networks the simple duplication of nodes and connections results in more accurate, faster, and more stable computation. 1 Introduction
It has been estimated that the human brain has more than 10" neurons in all of its many functional subdivisions, and that each neuron is connected to around lo4 other neurons (Amari 1978; DARPA 1988). Furthermore, these neurons and connections are redundant. Artificial systems, in constrast, have been very much smaller and have normally had no redundancy at all. This lack of redundancy in artificial systems is due both to cost and to the generally held notion that redundancy in biological systems serves primarily to overcome the problems caused by the death of individual neurons. While it is true that redundant neural networks are more resistant to such damage (Tanaka et al. 1988), we will show that there are other, perhaps more important computational effects associated with network redundancy. In this paper we mathematically analyze the functional effects of neuron duplication, the simplest form of redundancy, and prove that duplicated neural networks converge faster than unduplicated networks, require less accuracy in interneuron communication, and converge to more accurate solutions. These results are obtained by showing that each duplicated network is equivalent to an unduplicated one with sharpened nonlinearities and initial values that are normally distributed with a smaller variance. *Current address: Industrial Systems Lab., Mitsubishi Electric Corp., 8-1-1, Tsukaguchi, Amagasaki, Hyogo 661 Japan. Neural Computation 2, 226-238 (1990) @ 1990 Massachusetts Institute of Technology
Neural Networks with Redundancy
227
Further, we prove that the asynchronous operation of such networks produces faster and more stable convergence than the synchronous operation of the same network. These results are obtained by showing that each asynchronous network is approximately equivalent to a synchronous network that uses the Hessian of the energy function to update the network weights. 2 Feedforward Neural Networks
2.1 Network Duplication. For simplicity we start by considering three-layer feedforward neural networks (Rumlehart et al. 19861, which are duplicated L times at the input layer and A4 times at the hidden layer. A duplication factor of A4 means that M neurons have exactly same inputs, that is, each input forks into M identical signals that are fed into the corresponding neurons. To produce a duplicated network from an unduplicated one, we start by copying each input layer neuron and its input-output connections L times, setting the initial weights between input and hidden layers to uniformly distributed random values. We then duplicate the hidden layer M times by simply copying each neuron and its associated weights and connections A4 times. Although we will not mathematically analyze the case of hidden-layer weights that are randomly distributed rather than simply copied, it is known experimentally that randomly distributed weights produce better convergence. The energy function or error function of these neural networks are defined as:
where K p and 4 are the ith input and kth output of the pth training data, Wizo is the weight between the ith neuron with lth duplication at the input layer and the jth neuron with mth duplication at the hidden layer, W$) is the weight between the jth neuron with mth duplication at the hidden layer and the kth neuron at the output layer, and g ( x ) = 1/(1+ e P ) is the sigmoid function. For n = 1 , 2 and r = ml, m, learning can be conducted by either the simple gradient method,
or by the momentum method,
where 7 and p may be thought of as the network gain, and (1- a ) as the network damping.
Yoshio Izui and Alex Pentland
228
2.2 Equivalent Neural Networks. By employing the average of the duplicated weights
which is normally distributed we can derive an unduplicated network that is equivalent to the above duplicated network. The energy function of this unduplicated network is
where we now assume a single network duplication factor D = L = M for simplicity of exposition. By rewriting equations 2.2 and 2.3 using the averaged weights we can derive learning equations for this equivalent, unduplicated network. For the gradient update rule we have (2.6)
for the momentum update rule we obtain
Comparison of the original unduplicated network's update equations with the above update rules for the duplicated network's equivalent shows that duplicated networks have their sigmoid function sharpened by the network duplication factor D, but that the "force" causing the weights to evolve is reduced by 1 / D . As a consequence both the duplicated and unduplicated networks follow the same path in weight space as they evolve. The factor of D will, however, cause the duplicated network to converge much faster than the unduplicated network, as will be shown in the next section. 2.3 Convergence Speed.
2.3.1 The Gradient Descent Update Method. We first consider the gradient descent = - method of updating the weights. First, let us define DW1ji, W 2 k j = DW2kj, and % 2 = d K i / d t , zXj = dWTj/dt. The network convergence t h e Tgadient can be obtained by first rewriting equation 2.6 to obtain an expression for dt, and then by integrating dt:
ci
Neural Networks with Redundancy
229
where
For given and initial values the integral part of equation 2.8 is a constant. Thus we obtain the result that a network‘s convergence time Tgradimt is proportional to 1/D, where D is the network duplication factor.’ The dramatic speed-up in convergence caused by duplication may be a major reason biological systems employ such a high level of redundancy. Note that the same D-fold speed-up can be achieved in an unduplicated network by simply sharpening the sigmoid function by a factor of D. Readers should be cautioned, however, that these equations describe a continuous system, whereas computer simulations employ a finite difference scheme that uses discrete time steps. As D becomes large the finite difference approximation can break down, at which point use of a D-sharpened sigmoid will no longer result in faster convergence. The value of D at which breakdown occurs is a function of the network‘s maximum weight velocities.
2.3.2 The Momentum Update Method. The more complex momentum update method may be similarly treated. The convergence time Tmomentum is (2.10)
(2.11) and for n = 1,2, (2.12)
-
An2
=
(3) 2
(2.13)
If we assume that PD >> (1 - a), that is, that the gain of the system times the sensitivity of the sigmoid function is much larger than the
where the WTi have a normal distribution p~ = N ( 0 ,W,2/3D). For small Wa and large D,p~ is approximately a delta function located at the average zero, so that the convergence time Tgradient is not much affected by the distribution of initial values.
Yoshio Izui and Alex Pentland
230
amount of damping in the system, then the damping may be ignored to achieve the following approximation: (2.14) The solution of (2.14) is (2.15) where Clji is a constant. When D is large equation 2.10 may be simplified as follows:
I
f
ds
where d s is as in equation 2.9. We may simplify equation 2.16 still further by noting that
(2.17) We first use this relation to obtain Einitial,the energy at the initial state, (2.18) assuming the standard initial values zTi = .5?j = 0. We can then use equations 2.18 and 2.17 to reduce our expression for Tmomentum to the following: Tmomentum
=-
ds J2P(Einitial - E)
(2.19)
Thus we obtain the result that for the momentum method the netwhere D is the network work convergence time is proportional to l/n, duplication factor. Thus, as with the gradient descent method, dramatic speed-ups are available by employing network redundancy. Again, the same effect may be obtained for unduplicated networks by simply using a D-sharpened sigmoid function, however with the momentum method great care must be taken to keep D small enough that the finite difference approximation still holds.
Neural Networks with Redundancy
cn
231
4.c
-..
I neory
0
I
2.0
‘ 0
I
1
I
2
I
3
Log10 D Figure 1: The relationship between convergence epochs and D. 2.4 Experimental Results. Figure 1 shows experimental results illustrating how convergence speed is increased by network redundancy. This example shows the number training set presentations (”epochs”)required to learn an XOR problem as a function of the network duplication factor D. Learning was conducted using a momentum update method with a = 0.9, P = 0.1, W, = 0.5, and convergence criterion of 10% error, each data point is based on the average of 10 to 100 different trials. The above theoretical result predicts a slope of -0.5 for this graph, the best fit to the data has a slope of -0.43 which is within experimental sampling error. 3 Feedback Neural Networks
3.1 Network Duplication and Equivalence. The energy function for feedback neural networks (Hopfield and Tank 1985) with D duplication is defined in a manner similar to that for feedforward networks:
where Tij = Tji is the weight between neuron i and j in their original index, Ii is the forced signal from environment to neuron i in its original
Yoshio Izui and Alex Pentland
232
index, and y(')is the output signal at the 1 duplicated ith neuron as defined in the following two equations: (3.2) (3.3) where ui')is the internal signal and r is the decay factor. Given large 7 and random, zero-average initial values of u:",the duplicated network governing equations 3.1, 3.2, and 3.3 can be rewritten to obtain an equivalent unduplicated network as follows:
(3.5) (3.6)
(3.7) (3.8) The equivalent unduplicated energy function is defined by equation 3.4, the dynamic behavior of the neurons is described by equation 3.5, and the transfer function at each neuron is given by equation 3.6. Examination of equation 3.5 reveals that this equivalent network has an updating "force" that is D times larger than the original unduplicated network, so we may expect that the duplicated network will have a faster rate of convergence. 3.2 Convergence Speed. The convergence time tained as in the feedforward case:
Tfeedback
can be ob-
(3.9)
Given large T and D,
Tfeedback
can be approximated by (3.10)
As with feedforward neural networks, the time integral is a constant given initial values, so that we obtain the result that network convergence time Tfedback is proportional to 1/ D.
Neural Networks with Redundancy
233
3 2 1
1
2
3
Figure 2 The relationship between convergence iterations and D. 3.3 Experimental Results. Figure 2 illustrates how network convergence is speeded up by redundancy. In this figure the number of iterations required to obtain convergence is plotted for a traveling salesman problem with 10 cities as a function of network duplication D. In these examples 7 = 10.0, dt = 0.01, K j and Ii are randomly distributed over &l.O, and u4 are randomly distributed over Each data point is the average of 100 trials. The above theoretical result predicts a slope of -1.0, the best fit to the data has a slope of -0.8 which is within experimental sampling error.
=tl/m.
3.4 Solution Accuracy. As the duplication factor D increases, the distribution of initial values u8 in the equivalent unduplicated network becomes progressively more narrow, as up is normally distributed with mean zero and variance ui/3D, where -uo < uf < uo. Experimentally, it
234
Yoshio Izui and Alex Pentland
is known (Uesaka 1988, Abe 1989) that if initial values are concentrated near the center of the ui's range then better solutions will be obtained. Thus the fact that increasing D produces a narrowing of the distribution of up indicates that we may expect that increasing D will also produce increased solution accuracy. We have experimentally verified this expectation in the case of the traveling salesman problem. 3.5 Communication Accuracy. One major problem for analog implementations of neural networks is that great accuracy in interneuron communication (i.e., accuracy in specifylng the weights) is required to reach an accurate solution. Network duplication reduces this problem by allowing statistical averaging of signals to achieve great overall accuracy with only low-accuracy communication links. For example, if the u:in a feedback network have a range of f128 and uniform noise with a range of f4 giving a communication accuracy of five bits, then for the averaged up the noise will have a standard deviation of only f 2 . 3 / n , giving roughly 10 bits of communication accuracy when D = 100. 4 Operation Mode
Given the advantages of redundancy demonstrated above it seems very desirable to employ large numbers of neurons; to accomplish this in a practical and timely manner requires a large degree of hardware parallelism, and the difficulty of synchronizing large numbers of processors makes asynchronous operation very attractive. It seems, therefore, that one consequence of a large degree of redundancy is the sort of asynchronous operation seen in biological systems. The computational effects of choosing a synchronous or asynchronous mode of operation has generally been regarded as negligible, although there are experimental reports of better performance using asynchronous update rules. In the following we mathematically analyze the performance of synchronous and asynchronous operation networks by proving that the operation of an asynchronous network is equivalent to that of a particular type of synchronous network whose update rule considers the Hessian of the energy function. We can therefore show that asynchronous operation will generally result in faster and more stable network convergence. 4.1 Equivalent Operation Mode. In the preceding discussion we assumed synchronous operation where the network state W(Tj) is updated at each time ?+I = Tj + AT:
(4.1)
Neural Networks with Redundancy
235
where
and E is the energy or error function. To describe asynchronous operation (and for simplicity we will consider only the gradient descent update rule) we further divide the time interval AT into K smaller steps tl, such that tl+l = tl+At where At = ATIK, to = 0 and K is large, thus obtaining the following update equations:
is now the time averaged update equation, which is related to the detailed behavior of the network by the relations
(4.4)
and
Equations 4.3 to 4.6 describe “microstate“ updates that are conducted throughout each interval T3 whenever the gradient at subinterval tl is available, that is, whenever one of the network’s neurons fire.
4.2 Synchronous Equivalent to Asynchronous Operation. We first define the gradient and Laplacian of E at asynchronous times t l to be
and define that all subscripts of A, B, and t are taken to be modulo K . We will next note that the gradient at time tl+l can be obtained by using the gradient at time tl and Laplacian at times tl, . . . ,t l as below: (4.8) (4.9) I
(4.10)
Yoshio Lzui and Alex Pentland
236
Assuming that AT is small, and thus that At is also small, then at time Tj the K-time-step time-averaged gradient is (4.11) (4.12) (4.13) (4.14) Thus the time-averaged state update equation for an asynchronous network is (4.15) This update function is reminiscent of second-order update functions which take into account the curvature of the energy surface by employing the Hessian of the energy function (Becker and Cun 1988):
dw dT
- = -q(1+ pVhE)-'VwE
(4.16)
where the identity matrix I is a "stabilizer" that improves performance when the Laplacian is small. Taking the first-order Taylor expansion of equation 4.16 about B&E = 0, we obtain
dw
- = -q(1-
dT
pV&E)VwE
(4.17)
and setting p = qATf2 we see that equations 4.15 and 4.16 are equivalent (given small AT so that the Taylor expansion is accurate). Thus equation 4.17 is a synchronous second-order update rule that is identical to the time-averaged asynchronous update rule of equation 4.15. The only assumption required to obtain this equivalence is that the time step AT is small enough that the approximations of equations 4.13 and 4.17 are valid. Investigating equation 4.17 reveals the source of the advantages enjoyed by asynchronous update rules. In the first stages of the convergence process (where the energy surface is normally concave upward) V L E are negative and thus larger updating steps are taken, speeding up the overall convergence rate. On the other hand, during the last stages of convergence (where the energy surface is concave downward) V L E is positive and thus smaller updating steps are taken, preventing undesired oscillations.
Neural Networks with Redundancy
237
5 Conclusion
We have analyzed the effects of duplication, the simplest form of redundancy, on the performance of feedforward and feedback neural networks. We have been able to prove that D-duplicated networks are equivalent to unduplicated networks that have (1) a D-sharpened sigmoid as the transfer function, and (2) normally distributed initial weights. Further, we have been able to prove that the asynchronous operation of such networks using a gradient descent update rule is equivalent to using a synchronous second-order update rule that considers the Hessian of the energy function. By considering the properties of these equivalent unduplicated networks we have shown that the effects of increasing network redundancy are increased speed of convergence, increased solution accuracy, and the ability to use limited accuracy interneuron communication. We have also shown that the effects of asynchronous operation are faster and more stable network convergence. In light of these results it now appears that the asynchronous, highly redundant nature of biological systems is computationally important and not merely a side-effect of limited neuronal transmission speed and lifetime. One practical consequence of these results is that in computer simulations one can obtain most of the computational advantages of a duplicated network by simply sharpening the transfer function, initializing the weights as the normally distributed values, and employing a secondorder update rule.
References Abe, S. 1989. Theories on the Hopfield neural networks. Int. Joint Conf. Neural Networks, Washington, D.C., 1557-1564. Amari, S. 1978. Mathematics of Neural Networks, 1. Sangyo Tosyo, Tokyo, Japan (in Japanese). Becker, S., and Cun, Y. L. 1988. Improving the convergence of back-propagation learning with second order methods. Proc. Connectionist Models Summer School, CMU, Pittsburgh, 29-37. DARPA. 1988. Neural Network Study 31. AFCEA International Press, Fairfax, VA.
Hopfield, J. J., and Tank, D. W. 1985. Neural computation of decisions in optimization problems. Biol. Cybernet. 52, 141-152. Rumelhart, D. E., Hinton, G . E., and Williams, R. J. 1986. Learning internal representations by error propogation. In D. E. Rumelhart, J. L. McClelIand, and the PDP Research Group, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. MIT Press, Cambridge, MA.
238
Yoshio Izui and Alex Pentland
Tanaka, H., Matsuda, S., Ogi, H., h i , Y., Taoka, H., and Sakaguchi, T. 1988. Redundant coding for fault tolerant computing on Hopfield network. Abstr. First Annu. lNNS Meeting, Boston, 141. Uesaka, Y. 1988. On the Stability of Neural Nefwork With the Energy Function Induced from a Real-Valued Function of Binay Variables. IEICE of Japan, Tech. Rep. PRU-88-6 7-14 (in Japanese).
Received 26 December 1989; accepted 9 February 1990.
Communicated by Jack Cowan
Stability of the Random Neural Network Model Erol Gelenbe Ecole des Hautes Etudes en Informatique, Universite R e d Descartes (Paris V), 45 rue des Saints-Pkres, 75006 Paris, fiance
In a recent paper (Gelenbe 1989) we introduced a new neural network model, called the Random Network, in which "negative" or "positive" signals circulate, modeling inhibitory and excitatory signals. These signals can arrive either from other neurons or from the outside world: they are summed at the input of each neuron and constitute its signal potential. The state of each neuron in this model is its signal potential, while the network state is the vector of signal potentials at each neuron. If its potential is positive, a neuron fires, and sends out signals to the other neurons of the network or to the outside world. As it does so its signal potential is depleted. We have shown (Gelenbe 1989) that in the Markovian case, this model has product form, that is, the steadystate probability distribution of its potential vector is the product of the marginal probabilities of the potential at each neuron. The signal flow equations of the network, which describe the rate at which positive or negative signals arrive at each neuron, are nonlinear, so that their existence and uniqueness are not easily established except for the case of feedforward (or backpropagation) networks (Gelenbe 1989). In this paper we show that whenever the solution to these signal flow equations exists, it is unique. We then examine two subclasses of networks - balanced and damped networks - and obtain stability conditions in each case. In practical terms, these stability conditions guarantee that the unique solution can be found to the signal flow equations and therefore that the network has a well-defined steady-state behavior. 1 Introduction
We consider a network of n neurons in which positive and negative signals circulate. Each neuron accumulates signals as they arrive, and can fire if its total signal count at a given instant of time is positive. Firing then occurs at random according to an exponential distribution of constant rate, and it sends signals out to other neurons or to the outside of the network. In this model, each neuron i of the network is represented at any time t by its input signal potential ki(t),which we shall simply call the potential. Neural Computation 2, 239-247 (1990) @ 1990 Massachusetts Institute of Technology
Erol Gelenbe
240
Positive and negative signals have different roles in the network; positive signals represent excitation, while negative signals represent inhibition. A negative signal reduces by 2 the potential of the neuron at which it arrives (i.e., it "cancels" an existing signal) or has no effect on the signal potential if it is already zero, while an arriving positive signal adds 2 to the neuron potential. The potential at a neuron is constituted only by positive signals that have accumulated, have not yet been cancelled by negative signals, and have not yet been sent out by the neuron as it fires. Signals can either arrive at a neuron from the outside of the network (exogenous signals) or from other neurons. Each time a neuron fires, a signal leaves it depleting the total input potential of the neuron. A signal that leaves neuron z heads for neuron j with probability p+(i,j) as a positive (or normal) signal, or as a negative signal with probability p - ( z , j ) , or it departs from the network with probability d(z). Let p ( i , j ) = p+(z,j ) + p-(z, j ) ; it is the transition probability of a Markov chain representing the movement of signals between neurons. We shall assume that p + ( i , i ) = 0 and p-(i,z) = 0; though the former assumption is not essential we insist on the fact that the latter indeed is to our model; this assumption excludes the possibility of a neuron sending a signal directly to itself. Clearly we shall have C p ( i , j )+&) = 1 for 1 5 i 5 n 3
A neuron is capable of firing and emitting signals if its potential is strictly positive. We assume that exogenous signals arrive at each neuron in Poisson streams of positive or negative signals. In Gelenbe (1989) we show that the purely Markovian version of this network, with positive signals that arrive at the ith neuron according to a Poisson process of rate A(i), negative signals which arrive to the ith neuron according to a Poisson process of rate X(z), iid exponential neuron firing times with rates r(l),.. . ,r(n),and Markovian movements of signals between neurons, has a product form solution. That is, the network's stationary probability distribution can be written as the product of the marginal probabilities of the state of each neuron. Thus in steady state the network's neurons are seemingly independent, though they are in fact coupled via the signals that move from one neuron to the other in the network. The model we propose has associative memory capabilities, as we shall see in Example 1. It also has a certain number of interesting features: 1. It appears to represent more closely the manner in which signals are transmitted in a biophysical neural network where they travel as voltage spikes rather than as fixed signal levels. 2. It is computationally efficient in the feedforward case, and whenever network stability can be shown as for balanced and damped networks.
Random Neural Network Model
241
3. It is closely related to the connexionist model (Gelenbe 1989) and it is possible to go from one model to the other. 4. It represents neuron potential and therefore the level of excitation as an integer, rather as a binary variable, which leads to more detailed information on system state.
As one may expect from previous models of neural networks (Kandel and Schwartz 1985), the signal flow equations that yield the rate of signal arrival and hence the rate of firing of each neuron in steady state are nonlinear. Thus in Gelenbe (1989) we were able to establish their existence (and also a method for computing them) only in the case of feedforward networks, that is, in networks where a signal cannot return eventually to a neuron that it has already visited either in negative or positive form. This of course covers the case of backpropagation networks (Kandel and Schwartz 1985). In this paper we deal with networks with feedback. We are able to establish uniqueness of solutions whenever existence can be shown. Then we show existence for two classes of networks: "balanced" and "damped" networks. 2 General Properties of the Random Network Model
The following property proved in Gelenbe (1989) states that the steadystate probability distribution of network state can always be expressed as the product of the probabilities of the states of each neuron. Thus the network in steady state is seemingZy composed of independent neurons, though this is obviously not the case. Let k ( t ) be the vector of signal potentials at time t, and k = ( k l , . . . , krJ be a particular value of the vector; we are obviously interested in the quantity ~ ( kt ),= Prob[k(t) = k ] . Let ~ ( kdenote ) the stationary probability distribution p ( k >=
Prob[k(t) = k ]
if it exists.
Theorem 1. (Gelenbe 1989) Let qi
= A+(Z)/[r(Z) + X-(i)l
(2.1)
where the X + ( i ) , X-(i) for i = 1,. . . , n satisfy the following system of nonlinear simuitaneous equations:
Erol Gelenbe
242
If a unique nonnegative solution {A+(i), A - ( i ) } exists to equations 2.1 and 2.2 such that each qi < 1, then: 11
Corollary 1.1. The probability that neuron i is firing in steady state is simply given by qi and the average neuron potential in steady state is simply Ai= qi/[l - qil.
By Theorem 1 we are guaranteed a stationary solution of product form provided the nonlinear signal flow equations have a nonnegative solution. The following result guarantees uniqueness in general. Theorem 2. If the solutions to equations 2.1 and 2.2 exist with they are unique.
qi
< 1, then
Proof. Since { k ( t ) : t 2 0) is an irreducible and aperiodic Markov chain (Gelenbe and Pujolle 1986), if a positive stationary solution p ( k ) exists, then it is unique. By Theorem 1, if the 0 < 4%< 1solution to equations 2.1 and 2.2 exist for i = 1,. . . ,n, then ~ ( kis) given by equation 2.3 and is clearly positive for all k. Suppose now that for some i there are two different qa, qi satisfying equations 2.1 and 2.2. But this implies that for all kzllimt+mP[k,(t) = 01 has two different values [l - q,] and [l - qi], which contradicts the uniqueness of p(k); hence the result. a,,
We say that a network is feedforward if for any sequence i l l . . . ,i,, . . ., = i, for r > s implies
. . . ,im of neurons, i, m-l
J-Jp k , iV+d= 0 v=l
I
Theorem 3. (Gelenbe 1989)If the network is feedforward, then the solutions A+($, A-(i) to equations 2.1 and 2.2 exist and are unique. The main purpose of this paper is to extend the class of networks for which existence of solutions to equations 2.1 and 2.2 is established. We shall deal with balanced networks and with damped networks.
Example 1. [A feedback network with associative memory for (O,O), (l,O), (031 The system is composed of two neurons shown in Figure 1. Each neuron receives flows of positive signals of rates A(l), A(2) into neurons 1 and 2. A signal leaving neuron 1 enters neuron 2 as a negative signal and a signal leaving neuron 2 enters neuron 1 as a negative signal p-(l, 2) = p-(2,1) = 1. The network is an example of a "damped network discussed in Section 4.
Random Neural Network Model
I
243
negative
I
Figure 1: The neural network with positive and negative signals examined in Example 1.
where vectors k with negative elements are to be ignored and 1[X] takes the value 1 if X is true and 0 otherwise. According to Theorems 1 and 2 the unique solution to these equations if it exists, is p ( k ) = (1 - u)(l
- V)UL'UL2
if u < 1 , v < 1, where -u = A(l)/[r(l)+A-(l)I, u = A(2)/[r(2)+X-(2)], with -X-(2) = u r ( l ) ,X(1) = vr(2). Since u,u are solutions to two simultaneous second degree equations, existence problems for this example are simple. For instance when A(1) = A(2) = r(1) = r(2) = 1 we obtain X-(l) = X-(2) = 0.5[5'/' - 11, so that
Erol Gelenbe
244
u , v in the expression for p ( k ) become u = v = 2/11 + 51/21 = 0.617, SO that the average potential at each neuron is A1 = A2 = 1.611. If we set A(1) = 1, A(2) = 0, with the same values of r(1) = r(2) = 1, we see that neuron 1 saturates (its average potential becomes infinite), while the second neuron’s input potential is zero, and vice versa if we set A(1) = 0, A(2) = 1. Thus this network recognizes the inputs (0,l) and (1,O). 3 Balanced Neural Networks
We now consider a class of networks whose signal flow equations have a particularly simple solution. We shall say that a network with negative and positive signals is balanced if the ratio
Si =
[x
qjr(j)p+(j,2) + A(i)l/tC q j r ( j ) p - c , 2) + X(i)+ ~ ( 9 1
j
j
is identical for any i = 1,.. . ,n. This in effect means that all the qi are identical. Theorem 4. The signal flow equations 2.1 and 2.2 have a (unique) solution if the network is balanced. Proof. From equations 2.1 and 2.2 we write qt =
[x
qjr(j)p+G,i ) + A(i)l/[x q j d j ) p - ( j , i) + X ( i )
j
+ di)l
(3.1)
j
If the system is balanced, qi = qj for all i , j . From equation 3.1 we then have that the common q = qi satisfies the quadratic equation:
q2R-(i)+ q [ X ( i ) + r(i) - R’(i)] - A(i) = 0
(3.2)
where R-(i) = C j r ( j ) p - ( j ,i), R+(i)= C, r(j)p+c,i). The positive root of this quadratic equation, which will be independent of i, is the solution of interest: q =
{(R’(i) - X ( i ) - r(i)) + [ ( R f ( i) X(i) -~ ( 2
+ 4R-(i)A(i)1”2}/2R-(i)
) ) ~
4 Damped Networks
We shall say that a random neural network is damped if p‘6, i) 2 0, p - 0 , i) 2 0 with the following property:
r(i)+ Mi) > A M +
C rG)p+(j,i), j
for all i = 1,. . . ,n
Random Neural Network Model
245
Though this may seem to be a strong condition, Example 1 shows that it is of interest. A special class of damped networks that seems to crop up in various examples is those in which all internal signals are inhibitive (such as Example 1) so that p + ( j ,i ) = 0.
Theorem 5. If the network is damped then the customer flow equations 2.1 and 2.2 always have a solution with qi < 1, which is unique by Theorem 2. Proof. The proof uses a method developed for nonlinear equilibrium equations. It is based on the construction of an n-dimensional vector homotopy function H ( q , x ) for a real number 0 5 2 < 1. Let us define the following n-vectors: q = (41,. . . , q n ) ,
F ( q ) = [F1(q),. . . , F,(q)l
where F,(q) =
[x
qjr(j)p+(j,i)
+ A(i)l,"Cq j ~ ( j ) p - ( ji), + X ( i ) + di)I j
j
Clearly, the equation we are interested in is q = F(q),which, when it has a solution in D = [0, l]",yields the values qi of Theorem 1. Notice that F(q): R" 4 R". Notice also that F(q) E C2. Consider the mappings F(q): D -+ R". We are interested in the interior points of D since we seek solutions 0 < qi < 1. Write D = Do U SD where SD is the boundary of D, and Do is the set of interior points. Let y = (yl, . . . , yn) where yi =
Ix d j ) p + ( j ,i) + A(i)l/[A(i) + ddl 3
By assumption yi < 1 for all i
=
1, . . . ,n. Now define
H ( q , 2 ) = (1- s)(q - 9 ) + 21q - F(q)l, 0 I z
< 1.
Clearly H(q, 0) = q - y and H ( q , 1)= q - F(q). Consider
H-'
=(9:q E
D , H ( q , x ) = 0 and 0 5 z < 1)
We can show that H-' and SD have an empty intersection, that is, as 17: varies from 0 to 1 the solution of H ( q , x), if it exists, does not touch the boundary of D. To prove this assume the contrary; this implies that for some 2 = Z* there exists some q = q* for which H ( q * , x * )= 0 and such that qt* = 0 or 1. If q: = 0 we can write -(1
-
Z*)Yi - s*Fi(q*)= 0
or 2*/(1 - 2 * )= -yz/Fi(q*) < 0
*
2*
<0
Erol Gelenbe
246
which contradicts the assumption about x. If on the other hand q: = 0, then we can write
4 1 - x*)(1- ya) - x*[1 - F,(q*)]= 0, or .*/(1 - x*)= 4 1 - Y i ) / t l - F,(q*)l < 0
* x* < 0
because (1 - yi) > 0 and 0 < F&*) < yi so that [l - Fi(q*)l> 0, contradicting again the assumption about x. Thus H(q, x) = 0 cannot have a solution on the boundary SD for any 0 5 x < 1. As a consequence, applying Theorem 3.2.1 of Garcia and Zangwill(1981) (which is a LeraySchauder form of the fixed-point theorem), it follows that F(q) = q has at least one solution in Do ;it is unique as a consequence of Theorem 4.
5 Conclusions We pursue the study of a new type of neural network model, which we had introduced previously, and which had been shown to have a product form steady-state solution (Gelenbe 1989). The nonlinear signal flow equations of the model concerning the negative and positive signals that circulate in the network have been shown to have a unique solution when the network has a feedforward structure (Gelenbe 1989) that is equivalent to backpropagation networks (Rumelhart et al. 1986). In this paper we present new results for networks with feedback, and begin with a simple example that exhibits associative memory capabilities. The key theoretical issue that has to be resolved each time the random neural network with feedback is used is the existence of a solution to the signal flow equations; we therefore first show that these equations have a unique solution whenever the solution exists. We then study two classes of networks that have feedback balanced networks and damped networks, and obtain conditions for the existence of the solution in each case. Acknowledgments The author gratefully acknowledges the hospitality of the Operations Research Department at Stanford University, where this work was carried out during July and August of 1989, and the friendly atmosphere of Dave Rumelhart's seminar in the Psychology Department at Stanford where this research was presented and discussed during the summer of 1989. The work was sponsored by C3-CNRS.
Random Neural Network Model
247
References Garcia, C. D., and Zangwill, W. I. 1981. Pathways to Solutions, Fixed Points, and Equilibria. Prentice-Hall, Englewood Cliffs, NJ. Gelenbe, E. 1989. Random neural networks with negative and positive signals and product form solution. Neural Cornp. 1, 502-510. Gelenbe, E., and Pujolle, G. 1986. lntroduction to Networks of Queues. Wiley, Chichester and New York. Kandel, E. C., and Schwartz, J. H. 1985. Principles of Neural Science. Elsevier, Amsterdam. Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel Distributed Processing, Vols. I and 11. Bradford Books and MlT Press, Cambridge, MA. ~~
Received 2 November 1989; accepted 9 February 1990.
Communicated by Les Valiant
The Perceptron Algorithm is Fast for Nonmalicious Distributions Eric B. Baum NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 USA
Within the context of Valiant's protocol for learning, the perceptron algorithm is shown to learn an arbitrary half-space in time O(n2/c3) if 0, the probability distribution of examples, is taken uniform over the unit sphere 5'". Here t is the accuracy parameter. This is surprisingly fast, as "standard" approaches involve solution of a linear programming problem involving O(n/e) constraints in n dimensions. A modification of Valiant's distribution-independent protocol for learning is proposed in which the distribution and the function to be learned may be chosen by adversaries, however these adversaries may not communicate. It is argued that this definition is more reasonable and applicable to real world learning than Valiant's. Under this definition, the perceptron algorithm is shown to be a distribution-independent learning algorithm. In an appendix we show that, for uniform distributions, some classes of infinite V-C dimension including convex sets and a class of nested differences of convex sets are learnable. 1 Introduction
The perceptron algorithm was proved in the early 1960s (Rosenblatt 1962) to converge and yield a half space separating any set of linearly separable classified examples. Interest in this algorithm waned in the 1970s after it was emphasized (Minsky and Papert 1969) (1) that the class of problems solvable by a single half space was limited, and (2) that the perceptron algorithm, although converging in finite time, did not converge in polynomial time. In the 1980s, however, it has become evident that there is no hope of providing a learning algorithm that can learn arbitrary functions in polynomial time and much research has thus been restricted to algorithms that learn a function drawn from a particular class of functions. Moreover, learning theory has focused on protocols like that of Valiant (1984), where we seek to classify, not a fixed set of examples, but examples drawn from a probability distribution. This allows a natural notion of "generalization." There are very few classes that have yet been proven learnable in polynomial time, and one of these is the class of half spaces. Thus, there is considerable theoretical interest now in studying the problem of learning a single half space, and so it is Neural Computation 2,248-260 (1990) @ 1990 Massachusetts Institute of Technology
The Perceptron Algorithm is Fast
249
natural to reexamine the perceptron algorithm within the formalism of Valiant. In Valiant's protocol, a class of functions is called learnable if there is a learning algorithm that works in polynomial time independent of the distribution D generating the examples. Under this definition the perceptron learning algorithm is not a polynomial time learning algorithm. However we will argue in Section 2 that this definition is too restrictive. We will consider in Section 3 the behavior of the perceptron algorithm if D is taken to be the uniform distribution on the unit sphere S". In this case, we will see that the perceptron algorithm converges remarkably rapidly. Indeed we will give a time bound that is faster than any bound known to us for any algorithm solving this problem. Then, in Section 4, we will present what we believe to be a more natural definition of distribution-independent learning in this context, which we will call nonmalicious distribution-independent learning. We will see that the perceptron algorithm is indeed a polynomial time nonmalicious distribution-independent learning algorithm. In Appendix A, we sketch proofs that, if one restricts attention to the uniform distribution, some classes with infinite Vapnik-Chervonenkis dimension such as the class of convex sets and the class of nested differences of convex sets (which we define) are learnable. These results support our assertion that distribution independence is too much to ask for, and may also be of independent interest.
2 Distribution-Independent Learning
In Valiant's protocol (Valiant 1984), a class F of Boolean functions on 8" is called learnable if a learning algorithm A exists that satisfies the following conditions. Pick some probability distribution D on Xn. A is allowed to call examples, which are pairs [z,f(z)], where z is drawn according to the distribution D. A is a valid learning algorithm for F if for any probability distribution D on X",for any 0 < 6 , <~ 1, for any f E F , A calls examples and, with probability at least 1 - 6 outputs in time bounded by a polynomial in n, 6-', and 6-l a hypothesis g such that the probability that f(z)# g(z) is less than c for z drawn according to D. This protocol includes a natural formalization of "generalization" as prediction. For more discussion see Valiant (1984). The definition is restrictive in demanding that A work for an arbitrary probability distribution D. This demand is suggested by results on uniform convergence of the empirical distribution to the actual distribution. In particular, if F has Vapnik-Chervonenkis (V-C) dimension' d, then it has been proved 'We say a set S c R" is shattered by a class F of Boolean functions if F induces all Boolean functions on S. The V-C dimension of F is the cardinality of the largest set S that F shatters.
250
Eric 8. Baum
(Blumer et al. 1987) that all A needs to do to be a valid learning algorithm is to call 4 2 8d 13 M&, S, d ) = max(- log -, -log -1 E 6 E E examples and to find in polynomial time a function g E F that correctly classifies these. Thus, for example, it is simple to show that the class H of half spaces is Valiant learnable (Blumer et al. 1987). The V-C dimension of H is n + 1. All we need to do to learn H is to call M&, 6,n + 1) examples and find a separating half space using Karmarkar’s algorithm (Karmarkar 1984). Note that the perceptron algorithm would not work here, since one can readily find distributions for which the perceptron algorithm would be expected to take arbitrarily long times to find a separating half space. Now, however, it seems from four points of view that the distributionindependent definition is too strong. First, although the results of Blumer et al. (1987) tell us we can gather enough information for learning in polynomial time, they say nothing about when we can actually find an algorithm A that learns in polynomial time. So far, such algorithms have been found in only a few cases, and (see, e.g., Baum 1990) these cases may be argued to be trivial. Second, a few classes of functions have been proved (modulo strong but plausible complexity theoretic hypotheses) unlearnable by construction of cryptographically secure subclasses. Thus, for example, Kearns and Valiant (1988)show that the class of feedforward networks of threshold gates of some constant depth, or of Boolean gates of logarithmic depth, is not learnable by construction of a cryptographically secure subclass. The relevance of such results to learning in the natural world is unclear to us. For example, these results do not rule out a learning algorithm that would learn almost any log depth net. We would thus prefer a less restrictive definition of learnability, so that if a class were proved unlearnable, it would provide a meaningful limit on pragmatic learning. Third, the results of Blumer et al. (1987) imply that we can expect to learn a class of functions F only if F has finite V-C dimension. Thus, we are in the position of assuming an enormous amount of information about the class of functions to be learned -namely that it be some specific class of finite V-C dimension, but nothing whatever about the distribution of examples. In the real world, by contrast, we are likely to know at least as much about the distribution D as we know about the class of functions F. If we relax the distribution-independence criterion, then it can be shown that classes of infinite Vapnik-Chervonenkisdimension are learnable. For example, for the uniform distribution, the class of convex sets and a class of nested differences of convex sets (both of which trivially have infinite V-C dimension) are shown to be learnable in Appendix A. Fourth, Schapire (1989) has recently given a procedure by which any algorithm A that can be proved to learn well enough to correctly classify better than 50% of test examples in the distribution-free framework
The Perceptron Algorithm is Fast
251
can be transformed to an algorithm B achieving arbitrary accuracy E . The procedure works by applying A to successively harder distributions composed of higher fractions of the examples A had previously gotten wrong. Since A by hypothesis works for all distributions, it is still able to achieve better than 50% accuracy, even when confronted with many examples it had previously found difficult, and thus rapidly corrects its mistakes. Schapire's theorem has rightly been hailed as ingenious, and his methods may plausibly find application. I would argue, however, that by exploiting the distribution-independent strength A is hypothesized to have, Schapire's method has also produced a reductio ad absurdurn that highlights the fact that distribution-independent learning requires more power than seems likely in real-world arenas. [This paragraph added in proof.] 3 The Perceptron Algorithm and Uniform Distributions
The perceptron algorithm yields, in finite time, a half-space (WH,OH) that correctly classifies any given set of linearly separable examples (Rosenblatt 1962). That is, given a set of classified examples {x$} such that, for some (wt, Ot), w t'x? > Ot and w t.xf < Ot for all p, the algorithm converges in finite time to output a ( w H ,0,) such that w H . xy 2 OH and W H .x! < OH. We will normalize so that wt-wt= 1. Note that Iwt-x-Ot( is the Euclidean distance from z to the separating hyperplane {y : wt . y = O t } . The algorithm is the following. Start with some initial candidate (wo,O,J, that we will take to be (0,O).Cycle through the examples. For each example, test whether that example is correctly classified. If so, proceed to the next example. If not, modify the candidate by (Wk+, =Wk
*
e,,
Xg,
=
ok 7 1)
(3.1)
where the sign of the modification is determined by the classification of the misclassified example. In this section we will apply the perceptron algorithm to the problem of learning in the probabilistic context described in Section 2, where, however, the distribution D generating examples is uniform on the unit sphere S". Rather than have a fixed set of examples, we apply the algorithm in a slightly novel way: we call an example, perform a perceptron update step, discard the example, and iterate until we converge to accuracy E . If ~ we applied the perceptron algorithm in the standard way, it seemingly would not converge as rapidly. We will return to this point at the end of this section. Now the number of updates the perceptron algorithm must make to learn a given set of examples is well known to be 0(1/12>, where 1 is the minimum distance from an example to the classifying hyperplane (see 2We say that our candidate half space has accuracy t when the probability that it misclassifies an example drawn from D is no greater than E .
Eric 8. Baum
252
e.g., Minsky and Papert 1969). In order to learn to E accuracy in the sense of Valiant, we will observe that for the uniform distribution we do not need to correctly classify examples closer to the target separating hyperplane than WE/&. Thus we will prove that the perceptron algorithm will converge (with probability 1 - 6) after O(n/e2)updates, which will occur after presentations of examples. Indeed take Ot = 0 so the target hyperplane passes through the origin. Parallel hyperplanes a distance n/2 above and below the target hyperplane bound a band B of probability measure (3.2)
(for n 2 21, where A , = 2~("+')/~/r"(n+ 11/21 is the area of S" (see Fig. 1). Using the readily obtainable (e.g., by Stirling's formula) bound that A,-]/A, < 6,and the fact that the integrand is nowhere greater than 1, we find that for K = 6 / 2 6 , the band has measure less than €12. If Ot # 0, a band of width K will have less measure than it would for Bt = 0. We will thus continue to argue (without loss of generality) by assuming the worst case condition that Bt = 0. Since B has measure less than t / 2 , if we have not yet converged to accuracy E , there is no more than probability 1 / 2 that the next example on which we update will be in B. We will show that once we have made
6 48 -1
mo = max(1441n-
2'
K2
updates, we have converged unless more than 7/12 of the updates are in B. The probability of making this fraction of the updates in B, however, is less than 6/2 if the probability of each update lying in B is not more than 112. We conclude with confidence 1- 612 that the probability our next update will be in B is greater than 1/2 and thus that we have converged to €-accuracy. Indeed, consider the change in the quantity N(0)
QWt -
wk
/I2
+ 11 a0t
-
ok
]I2
(3.3)
when we update. AN
= /I a w t - Wk+l /I2 + I/ aQt - Qk+l /I2 - wk 11' - 11 aet - e k [I2 =
F2awt x* 5 '
+
II x 112
-
/I ffwt (3.4)
h 2Wk . x+ 7 20k
+I
Now note that f ( w k . x+ - 0,) < 0 since x was misclassified by (wk, 0,) (else we would not update). Let A = [ ~ ( .wx+~ - &)I. If z E B, then A 5 0. If z 4 B, then A 5 - ~ / 2 . Recalling x2 = 1, we see that AN < 2 for x E B and AN < -an + 2 for 3: 4 B. If we choose a = 8/n, we find that
The Perceptron Algorithm is Fast
253
BandB
Figure 1: The target hyperplane intersects the sphere S" along its equator (if Qt = 0) shown as the central line. Points in (say) the upper hemisphere are classified as positive examples and those in the lower as negative examples. The band B is formed by intersecting the sphere with two planes parallel to the target hyperplane and a distance n/2 above and below it.
A N 5 -6 for J: 6 B. Recall that, for k = 0, with ( ~ 0 ~ 0 = 0 )(O,O), we have N = cy2 = 64/tc2. Thus we see that if we have made 0 updates on points outside B, and I updates on points in B, N < 0 if 6 0 - 21 > 64/n2. But N is positive semidefinite. Once we have made 48/n2 total updates, at least 7/12 of the updates must thus have been on examples in B. If you assume that the probability of updates falling in B is less than 1/2 (and thus that our hypothesis half space is not yet at €-accuracy), then the probability that more than 7/12 of
s 48 mo = max(1441n- -) 2' K 2 updates fall in B is less than 612. To see this define LE(p,m, r ) as the probability of having at most T successes in m independent Bernoulli trials with probability of success p and recall (Angluin and Valiant 19791, for 0 5 p 5 1, that LE[p,m, (I
-
p>mp1 I e--/32mp'2
(3.5)
Eric B. Baum
254
Applying this formula with m = mOlp = 1/2, P = 1/6 shows the desired result. We conclude that the probability of making mo updates without converging to t-accuracy is less than 6/2. However, as it approaches 1 - E accuracy, the algorithm will update only on a fraction E of the examples. To get, with confidence 1 - 6/2, mo updates, it suffices to call M = 2m,/t examples. Thus, we see that the perceptron algorithm converges, with confidence 1- 6, after we have called 2 6 48n M = - max(1441n-, --) t 2 €2
(3.6)
examples. Each example could be processed in time of order 1 on a "neuron," which computes w k ' x in time 1and updates each of its "synaptic weights" in parallel. On a serial computer, however, processing each example will take time of order n, so that we have a time of order O(n2/e3)for convergence on a serial computer. This is remarkably fast. The general learning procedure, described in Section 2, is to call MO(e,6, n + 1) examples and find a separating halfspace, by some polynomial time algorithm for linear programming such as Karmarkar's algorithm. This linear programming problem thus contains Q(n/t)constraints in n dimensions. Even to write down the problem thus takes time St(n2/e). The upper time bound to solve this given by For large n the perceptron algorithm is Karmarkar (1984) is O(n5.5~-2). faster by a factor of n3.5.Of course it is likely that Karmarkar's algorithm for the particular distribution could be proved to work faster than Cl(n5.5) of examples of interest. If, however, Karmarkar's algorithm requires a number of iterations depending even logarithmically on n, it will scale worse (for large n) than the perceptron algorithm? Notice also that if we simply called M&, 6, n + 1) examples and used the perceptron algorithm, in the traditional way, to find a linear separator for this set of examples, our time performance would not be nearly as good. In fact, equation 3.2 tells us that we would expect one of these . ~ ) the target hyperplane, since we examples to be a distance O ( E / ~ 'from ) and a band of width O ( C / ~ ' .has ~ ) measure are calling n ( n / ~examples Q(c/n).Thus, this approach would take time R(n4/c3), or a factor of n2 worse than the one we have proposed. An alternative approach to learning using only O ( ~ / Eexamples, ) would be to call Mo(t/4,6, n + 1)examples and apply the perceptron algorithm to these until a fraction 1 - ~ / had 2 been correctly classified. This would suffice to assure that the hypothesis half space so generated would (with confidence 1 - 6) have error less than t, as is seen from Blumer et al. (1987, Theorem A3.3). It is unclear to us what time performance this procedure would yield. 3We thank P. Vaidya for a discussion on this point.
The Perceptron Algorithm is Fast
255
4 Nonmalicious Distribution-Independent Learning
Next we propose modification of the distribution-independence assumption, which we have argued is too strong to apply to real world learning. We begin with an informal description. We allow an adversary (adversary 1) to choose the function f in the class F to present to the learning algorithm A. We allow a second adversary (adversary 2) to choose the distribution D arbitrarily. We demand that (with probability 1 - 6) A converge to produce an t-accurate hypothesis g. Thus far we have not changed Valiant's definition. Our restriction is simply that before their choice of distribution and function, adversaries 1 and 2 are not allowed to exchange information. Thus, they must work independently. This seems to us an entirely natural and reasonable restriction in the real world. Now if we pick any distribution and any hyperplane independently, it is highly unlikely that the probability measure will be concentrated close to the hyperplane. Thus, we expect to see that under our restriction, the perceptron algorithm is a distribution-independent learning algorithm for H and converges in time O(~'/E~S')on a serial computer. If adversary 1 and adversary 2 do not exchange information, the least we can expect is that they have no notion of a preferred direction on the sphere. Thus, our informal demand that these two adversaries do not exchange information should imply, at least, that adversary 1 is equally likely to choose any w t (relative, e.g., to whatever direction adversary 2 takes as his z axis). This formalizes, sufficiently for our current purposes, the notion of nonmalicious distribution independence.
Theorem 1. Let U be the uniform probability measure on S" and D any other probability distribution on 5'". Let R be any region on S" of Umeasure €6and let x label some point in R. Choose a point y on S" randomly according to U . Consider the region R' formed by translating R rigidly so that x is mapped to y. Then the probability that the measure D(R') > E is less than 6. Proof. Fix any point z E S". Now choose y and thus R'. The probability 2 E R' is €6. Thus, in particular, if we choose a point p according to D and then choose R', the probability that p E R' is €6. Now assume that there is probability greater than 6 that D(R') > E. Then we arrive immediately at a contradiction, since we discover that I the probability that p E R' is greater than €6.
Corollary 2. The perceptron algorithm is a nonmalicious distribution-independent learning algorithm for half spaces on the unit sphere that converges, with confidence 1 - S to accuracy 1 - E in time of order O ( ~ L ~ / E ~ S ' ) on a serial computer. Proof Sketch. Let K' = ~6/2&. Apply Theorem 1 to show that a band formed by hyperplanes a distance d / 2 on either side of the target
256
Eric B. Baum
hyperplane has probability less than 5 of having measure for examples greater than t/2. Then apply the arguments of the last section, with n' in place of K . I
5 Summary We have argued that the distribution independence condition, although tempting theoretically because of elegant results that show one can rapidly gather enough information for learning, is too restrictive for practical investigations. Very few classes of functions are known to be distribution-independent learnable in polynomial time (and arguably these are trivial cases). Moreover, some classes of functions have been shown not to be learnable by construction of small, cryptographically secure subclasses. These results seem to tell us little about learning in the natural world, and we would thus prefer a less restrictive definition of learnable. Finally we argued that distribution-independent learning requires enormous and unreasonable knowledge of the function to be learned, namely that it come from some specific class of finite V-C dimension. We show in Appendix A, that for uniform distributions, at least some classes of infinite V-C dimension are learnable. Motivated by these arguments we computed the speed of convergence of the perceptron algorithm for a simple, natural distribution, uniform on S". We found it converges with high confidence to accuracy E in time O(n2/t3) on a serial computer. This is substantially faster, for large n, than the bounds known to us for any other learning algorithm. This speed is obtained, in part, because we used a variant of the perceptron algorithm that, rather than cycling through a fixed set of examples, called a new example for each update step. Finally we proposed what we feel is a more natural definition of learnability, nonmalicious distribution-independent learnability, where although the distribution of examples D and the target concept may both be chosen by adversaries, these adversaries may not collude. We showed that the perceptron learning algorithm is a nonmalicious, distributionindependent polynomial time learning algorithm on s". Appendix A Convex Polyhedral Sets Are Learnable for Uniform Distribution In this appendix we sketch proofs that two classes of functions with infinite V-C dimension are learnable. These classes are the class of convex sets and a class of nested differences of convex sets which we define. These results support our conjecture that full distribution independence is too restrictive a criterion to ask for if we want our results to have interesting applications. We believe these results are also of independent interest.
The Perceptron Algorithm is Fast
257
Theorem 3. The class C of convex sets is learnable in time polynomial in c-* and 6-’if the distribution of examples is uniform on the unit square in d dimensions.
Remarks. (1) C is well known to have infinite V-C dimension. ( 2 ) So far as we know, C is not learnable in time polynomial in d as well. Proof Sketch? We work, for simplicity, in two dimensions. Our arguments can readily be extended to d dimensions. The learning algorithm is to call M examples (where M will be specified). The positive examples are by definition within the convex set to be learned. Let M+ be the set of positive examples. We classify examples as negative if they are linearly separable from M,, i.e. outside of c,, the convex hull of M+. Clearly this approach will never misclassify a negative example, but may misclassify positive examples which are outside c+ and inside ct. To show €-accuracy, we must choose M large enough so that, with confidence 1 - 6, the symmetric difference of the target set ct and c+ has area less than E. Divide the unit square into k2 equal subsquares (see Fig. 2.) Call the set of subsquares that the boundary of ct intersects 11. It is easy to see that the cardinality of 11 is no greater than 4k. The set 1 2 of subsquares just inside 11 also has cardinality no greater than 4k, and likewise for the set 1 3 of subsquares just inside 1 2 . If we have an example in each of the squares in 1 2 , then ct and c, clearly have symmetric difference at most equal the area of I1 U I2 U I3 5 12k x k-2 = 12/lc. Thus, take k = 1216. Now choose M sufficiently large so that after M trials there is less than 6 probability we have not got an example in each of the 4k squares in 12. Thus, we need LE(k-’, M , 4k) < 6. Using equation 3.5, we see that I A4 = 500/c21n6 will suffice. Actually, one can learn (for uniform distributions) a more complex class of functions formed out of nested convex regions. For any set (c1, CZ, . . . ,cl} of 2 convex regions in Rd, let R1 = c1 and for j = 2 , . . . , 1 let R3 = R 3 - ~ n cj. Then define a concept f = R1- R2 + R3 - . . . Rl. The class C of concepts so formed we call nested convex sets (see Fig. 3). This class can be learned by an iterative procedure that peels the onion. Call a sufficient number of examples. (One can easily see that a number polynomial in I , € , and 6 but of course exponential in d will suffice.) Let the set of examples so obtained be called S. Those negative examples that are linearly separable from all positive examples are in the outermost layer. Class these in set S1. Those positive examples that are linearly separable from all negative examples in S - S 1 lie in the next layer - call this set of positive examples S,. Those negative examples in 4This proof is inspired by arguments presented in Pollard (1984, pp. 22-24). After this proof was completed, the author heard D. Haussler present related, unpublished results at the 1989 Snowbird meeting on Neural Networks for Computing.
258
Eric B. Baum
Figure 2: The boundary of the target concept ct is shown. The set 11 of little squares intersecting the boundary of Q is hatched vertically. The set 1 2 of squares just inside 11 is hatched horizontally. The set I3 of squares just inside 12 is hatched diagonally. If we have an example in each square in 12, the convex hull of these examplescontains all points inside Q except possibly those in 11~12, or I3.
S - S1 linearly separable from all positive examples in S - Sz lie in the next layer, S,. In this way one builds up 1 + 1 sets of examples. (Some of these sets may be empty.) One can then apply the methods of Theorem 3 to build a classifying function from the outside in. If the innermost layer 4+1 is (say) negative examples, then any future example is called negative if it is not linearly separable from Sl+l,or is linearly separable from S i and not linearly separable from Si-l, or is linearly separable from Si-2 but not linearly separable from Sl-3, etc.
The Perceptron Algorithm is Fast
Figure 3: c1 is the five-sided region, c2 is the triangular region, and square. The positive region c1 - c2 u cl + c3 U c2 U c1 is shaded.
259
c3
is the
Acknowledgments I would like to thank L. E. Baum for conversations and L. G. Valiant for comments on a draft. Portions of the work reported here were performed while the author was an employee of Princeton University and of the Jet Propulsion Laboratory, California Institute of Technology, and were supported by NSF Grant DMR-8518163 and agencies of the US. Department of Defense including the Innovative Science and Technology Office of the Strategic Defense Initiative Organization.
260
Eric 8. Baum
References Angluin, D. and Valiant, L. G . 1979. Fast probabilistic algorithms for Hamiltonian circuits and matchings. J. Comp. Systems Sci. 18, 155-193. Baum, E. B. 1990. On learning a union of half spaces. J. Complex. 5(4). Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. 1987. Learnability and the Vapnik-Chmonenkis Dimension. U.C.S.C. Tech. Rep. UCSC-CRL-8720, and J. ACM, to appear. Karmarkar, N. 1984. A new polynomial time algorithm for linear programming. Combinatorica 4,373-395. Kearns, M. and Valiant, L. 1989. Cryptographic limitations on learning Boolean formulae and finite automata. Proc. 21st ACM Symp. Theory of Computing, 433-444. Minsky, M. and Papert, S. 1969. Perceptrons, and Introduction to Computational Geometry. MIT Press, Cambridge, MA. Pollard, D. 1984. Convergence of Stochastic Processes. Springer-Verlag, New York. Rosenblatt, E 1962. Principles of Neurodynamics. Spartan Books, New York. Schapire, R. 1989. The strength of weak learnability. In Proceedings of the Thirtieth Annual Symposium on Foundations of Computer Science, Research Triangle Park, NC, 28-33. Valiant, L. G. 1984. A theory of the learnable. Comm. ACM 27(11), 1134-1142. Received 28 July 1989; accepted 7 December 1989.
REVIEW
Communicated by Scott Kirkpatrick
Parallel Distributed Approaches to Combinatorial Optimization: Benchmark Studies on Traveling Salesman Problem Carsten Peterson Department of Theoretical Physics, University of Lund, Solvegatan 14A, S-22362 Lund, Sweden We present and summarize the results from SO; loo-, and 200-city TSP benchmarks presented at the 1989 Neural Information Processing Systems (NIPS) postconference workshop using neural network, elastic net, genetic algorithm, and simulated annealing approaches. These results are also compared with a state-of-the-art hybrid approach consisting of greedy solutions, exhaustive search, and simulated annealing. 1 Background
Using neural networks to find approximate solutions to difficult optimization problems is a very attractive prospect. In the original paper (Hopfield and Tank 1985) 10- and 30-city *aveling salesman problems (TSP) were studied with very good results for the N = 10 case. For N = 30 the authors report on difficulties in finding optimal parameters. In Wilson and Pawley (1988) further studies of the Tank-Hopfield approach were made with respect to refinements and extension to larger problem sizes. Wilson and Pawley (1988) find the results discouraging. These and other similar findings have created a negative opinion about the entire concept of using neural network algorithms for optimization problems in the community. Recently a novel scheme for mapping optimization problems onto neural networks was developed (Peterson and Soderberg 1989a). The key new ingredient in this method is the reduction of solution space by one dimension by using multistate neurons [Potts spin (Wu 198311, thereby avoiding the destructive redundancy that plagues the approach of the original work by Tank and Hopfield (1985). The idea of using Potts glass for optimization problems was first introduced by Kanter and Sompolinsky (1987). This encoding was also employed by Van den Bout and Miller (1988). Very encouraging results were found when exploring this technique numerically (Peterson and Soderberg 1989a). An alternative approach to solve TSP in brain-style computing was developed by Durbin and Willshaw (1987)and Durbin et al. (1989),where Neural Computation 2, 261-269 (1990) @ 1990 Massachusetts Institute of Technology
262
Carsten Peterson
a feature map algorithm is used. Basically, an elastic "rubber band" is allowed to expand to touch all cities. The dynamic variables are the coordinates on the band, which vary with a gradient descent prescription on a cleverly chosen energy function. It has recently been demonstrated that there is a strong correspondence between this elastic net algorithm and the Potts approach (Simic 1990; Yuille 1990; Peterson and Soderberg 1989b). Parallel to these developments genetic algorithms have been developed for solving these kind of problems (Holland 1975; Muhlenbein et al. 1988) with extremely high quality results. Given the above mentioned scepticism toward the neural network approach and the relatively unknown success of the genetic approach we found it worthwhile to test these three different parallel distributed approaches on a common set of problems and compare the results with "standard simulated annealing. The simulations of the different algorithms were done completely independently at different geographic locations and presented at the 1989 NIPS postconference workshop (Keystone, Colorado). To further increase the value of this minireport we have also included comparisons with a hybrid approach consisting of greedy solutions, exhaustive search, and simulated annealing (Kirkpatrick and Toulouse 1985). The testbeds consisted of 50-, loo-, and 200-city TSP with randomly chosen city coordinates within a unit square. All approaches used an identical set of such city coordinates. The reason for choosing TSP is its wide acceptance as a NP-complete benchmark problem. The problem sizes were selected to be large enough to challenge the algorithms and at the same time feasible with limited CPU availability. Since the neural network approaches are known to have a harder time with random (due to the mean field theory averaging involved) than structured problems (Peterson and Anderson 1988) we chose the former. 2 The Algorithms
Before comparing and discussing the results we briefly list the key ingredients and parameter choices for each of the algorithms. 2.1 The Potts Neural Network (Peterson and Sbderberg 1989a). This algorithm is based on an energy function similar to the one used in the original work by Tank and Hopfield (1985).
(2.1)
In equation 2.1 the first term miminizes the tour length (Dij is the intercity distance matrix), and the second and third terms ensure that each
Approaches to CombinatorialOptimization
263
city is visited exactly once. A major novel property is that the condition
1si,= 1
(2.2)
a
is always satisfied; the dynamics is confined to a hyperplane rather than a hypercube. Consequently the corresponding mean field equations read (2.3)
where V,, =< S,,
>T
and the local fields Ui, are given by
1 dE (2.4) T aKa The mean field equations (equation 2.3) are minimizing the free energy ( F = E - T S ) corresponding to E in equation 2.1. A crucial parameter when solving equations 2.3 and 2.4 is the temperature T . It should be chosen in the vicinity of the critical temperature T,. In Peterson and Soderberg (1989a) is a method for estimating T, in advance by estimating the eigenvalue distribution of the linearized version of equation 2.3. This turns out to be very important for obtaining good solutions. For the details of annealing schedule, choice of a,P etc. used in this benchmark study we refer to the "black box" prescription in Peterson and Soderberg (1989a, Sect. 7).
u.au
-
2.2 T h e Elastic Net (Durbin and Willshaw 1987). This approach is more geometric. It is a mapping from a plane to a circle such that each city on the plane is mapped onto a point on the circle (path). The N city coordinates are denoted xi. Points on the path are denoted yu, where a = 1,.. . ,M. Note that M can in principle be larger than N . The algorithm works as follows: Start with a small radius circle containing the M y a coordinates with an origin slightly displaced from the center of gravity for the N cities. Let the y coordinates be the dynamic variables and change them such that the energy
(2.5)
is minimized. Gradient descent on equation 2.5 causes the initial circle to expand in order to minimize the distances between y and x coordinates in the first term of equation 2.5 at the same time as the total length is minimized by the second term. Good numerical results were obtained with this method with M > N (Durbin and Willshaw 1987). The parameter K in equation 2.5 has the role of a temperature and as in the case of the neural network approach above a critical value KO can be computed
Carsten Peterson
264
50 125 0.2 2.0 0.29 100 250 0.2 2.0 0.26 200 500 0.2 4.0 0.27
300 300 182
Table 1: Parameters Used for the Elastic Net Algorithm. The parameters 01 and /3 are chosen to satisfy conditions for valid tours (Durbin et al. 1989). from a linear expansion (Durbin et al. 1989). The values of the parameters used in this benchmark (Durbin and Yuille, private communication) can be found in Table 1. This algorithm is closely related to the Potts neural network (Simic 1990; Yuille 1990; Peterson and Soderberg 1989b). Loosely speaking this connection goes as follows. Since the mean field variables V,, are probabilities (cf. equation 2.2) the average distances between tour positions t a b = Iya - ybl and average distances between cities and tour positions dza = Ix, - yal can be expressed in terms of the distance matrix D,. The second term in equation 2.5 can then be identified with the tour length term in equation 2.1 if the metric is chosen to be De rather than D,. The first term in equation 2.5 corresponds to the entropy of the Potts system (equation 2.1); gradient descent on equation 2.5 corresponds to minimizing the free energy of the Potts system, which is exactly what the MFT equations are doing. There is a difference between the two approaches, which has consequences on the simulation level. Each decision element S,, in the Potts neural network approach consists of two binary variables, which in the mean field theory treatment becomes two analog variables; N cities require N 2 analog variables. In the elastic net case N cities only require 2M(M > N ) analog variables; it is a more economical way of representing the problem. 2.3 The Genetic Algorithm (Muhlenbein et al. 1988; Gorges-Schleuter 1989). For details we refer the reader to Muhlenbein et al. (1988) and Gorges-Schleuter (1989). Here we briefly list the main steps and the parameters used. 1. Give the problem to M individuals.
2. Let each individual compute a local minimum (2-quick'). 3. Let each individual choose a partner for mating. In contrast to earlier genetic algorithms (Holland 1975) global ranking of all indi'This is the 2-opt of Lin (1965) with no checkout.
Approaches to Combinatorial Optimization
N
M
D
50 64 8 100 200
r1 : ~2 : . . . : T g
0.25:0.20:0.15:0.10:0.10:0.10:0.05
265
[c1,c21
m
[ N / 4 , N / 2 ] 0.01
N,,, 30 23 562
Table 2 Parameters Used for the Genetic Algorithm. The choice of A4 = 64 was motivated by the available 64 T800 transputer configuration of Muhlenbein (private communication).
1. To = 10; Lo = N . 2. Until variance of cost function < 0.05, update according to T = T/0.8; L = N (heating up). 3. While percentage of accepted moves > 50%, update according to T = 0.95 x T ; L = N (cooling). 4. Until number of uphill moves = 0, update according to T = 0.95 x T ; L = 16N (slow cooling).
Figure 1: Annealing schedule. viduals is not used. Rather, local neighborhoods of size D - 1 were used in which the selection is done with weights ( T I , 7-2, . . . ,T D ) . The global best gets weight T I and and the remaining local neighbors get 7-2, . . . ,TD, respectively.
4. Crossover and mutation. A random string of "genes" is copied from the parent to the offspring. The string size is randomly chosen in the interval [el, c21. Mutation rate = m. 5. If not converged return to point 2. The parameters used for the benchmarks (Muhlenbein, private communication) are shown in Table 2. 2.4 Simulated Annealing (Kirkpatrick et al. 1983). The parameters of this algorithm are the initial temperature To, and the annealing schedule, which determines the next value of T and the length of time L spent at each T . The annealing schedule used is very generous (see Fig. 1) and is
Carsten Peterson
266
based on a requirement that the temperature be high enough such that the variance of the energy is less than T :
((E’) - (E)’) / T << 1
(2.6)
2.5 A Hybrid Approach’ (Kirkpatrickand Toulouse 1985). The above parallel distributed and simulated annealing approaches are clean in the sense that each of them relies on a single algorithm with no optimization of initial states etc. It would not be surprising that if one wants to push the quality to the limits a hybrid approach would be most rewarding. Indeed this turns out to be the case for a combination of greedy, simulated annealing and exhaustive search approaches (Kirkpatrick and Toulouse 1985). The procedure is as follows: 1. A greedy tour is obtained by starting at city #1, then proceeding to the nearest remaining city, then the nearest remaining city, and so on until all cities have been visited once.
2. Simulated annealing, using two-bond and restricted three-bond rearrangements as “moves,” is used to equilibrate the tours at a temperature chosen to yield approximately the same length as found by the greedy step 1. Subsequent annealing lowers the temperature to shrink the tours until accepted moves are scarce. 3. Exhaustive search is performed on the result of step 2, using twobond rearrangements until no more improvements are found, then restricted three-bond rearrangements. If a three-bond improvement is found, the exhaustive search resumes with two-bond moves, halting when no improvements are found using either type of move. 3 Results
In Table 3 we compare the averaged performance of the neural network (5 trials),-elastic net (1 trial), genetic algorithm (1 trial), simulated annealing (5 trials), and hybrid (5 trials) algorithms. For comparison we have also included the results from random distributions based on 1000 trials. Results of the greedy solutions (step 1 in the hybrid approach) are approximately equal to the figures reported for NN and SA in Table 3.
4 Summary and Comments The overall performance of the three different parallel distributed approaches is impressive. 2The author is responsible for denoting this method “hybrid.”
Approaches to Combinatorial Optimization
N
NN
EN
GA
SA
267
HA
RD
50 6.61 5.62 5.58 6.80 26.95 100 8.58 7.69 7.43 8.68 7.48 52.57 200 12.66 11.14 10.49 12.79 10.53 106.42
Table 3: Performance Averages (Tour Lengths) of the Neural Network (NN), Elastic Net (EN), Genetic Algorithm (GA), Simulated Annealing (SA), Hybrid Approach (HA), and Random Distribution (RD). No. of oDerations No. of iterations
NN EN GA HA
-
N3 -N MN
N
Const. Const. N (?)
N
Total N3 N2 M N (?) ~ N2 N
N
N
Table 4 Time Consumption for the Different Algorithms When Executed Serially. 0
0
0
They are all better or equal to ”standard” simulated annealing. These methods are here to stay! The genetic algorithm is the winner closely followed by the hybrid approach of Kirkpatrick and Toulouse (1985) and Kirkpatrick (private communication) and the elastic net method. Even though the neural network approach does not fully match the performance of the other two the results invalidate the common saying: ”Neural networks optimization algorithms do not work for N > 30 problems.” This algorithm has also been tested on larger problem sizes (C. Peterson, unpublished) than presented in this report with no sign of quality deterioration3 It is somewhat surprising that the performance of the neural network algorithm is of less quality than that of the elastic net given the close connection discussed above. We believe this is due to the fact that the former is more sensitive to the position of T,. Indeed, subsequent modifications of the annealing procedure and choice of T, (C. Peterson, unpublished) have lead to better re~ults.~Also, the performance of the NN algorithm can be substantially improved
31n order to strictly adhere to the NIPS presentations we did not include these extensions and improvements in this report.
Carsten Peterson
268
when starting out from a greedy solution, heating the system u p and letting it relax with the MFT equations? 0
Another important issue is computing time. It splits up into two parts: number of operations (in serial execution) per iteration for a given problem size N and the number of iterations needed for convergence. In Table 4 we compare these numbers for the different algorithms. The convergence times in this table are all empirical. The numbers in Table 4 have limited value since the real strength in the distributed parallel approaches is their inherent parallelism.
Acknowledgments I would like to thank Richard Durbin, Alan Yuille, Heinz Miihlenbein, and Scott Kirkpatrick for providing simulation results from the elastic net (Durbin and Yuille), genetic algorithm (Miihlenbein), and a hybrid approach (Kirkpatrick) for these comparisons. Also the stimulating atmosphere provided by the organizers of the 1989 NIPS postconference workshop is very much appreciated.
References Durbin, R., and Willshaw, G. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (London) 326, 689. Durbin, R., Szeliski, R., and Yuille, A. 1989. An analysis of the elastic net approach to the traveling salesman problem. Neural Comp. 1,348. Gorges-Schleuter, M. 1989. ASPARAGOS - An asynchronous parallel genetic optimization strategy. Proceedings of the Third International Conference on Genetic Algorithms, D. Schaffer, ed., p. 422. Morgan Kaufmann, San Mateo. Holland, J. H. 1975. Adaption in Natural and Adaptive Systems. University of Michigan Press, Ann Arbor. Hopfield, J. J., and Tank, D. W. 1985. Neural computation of decisions in optimization problems. Biol. Cybmnet. 52, 141. Kanter, I., and Sompolinsky, H. 1987. Graph optimization problems and the Potts glass. J. Phys. A 20, L673. Kirkpatrick, S., and Toulouse, G. 1985. Configuration space analysis of the travelling salesman problem. J. Phys. 46, 1277. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220,671. Lin, S. 1965. Computer solutions of the traveling salesman problem. Bell Syst. Techn. J. 44, 2245. Miihlenbein, H., Gorges-Schleuter, M., and Kramer, 0. 1988. Evolution algorithms in combinatorial optimization. Parallel Comp. 7, 65. Peterson, C., and Anderson, J. R. 1988. Applicability of mean field theory neural network methods to graph partitioning. Tech. Rep. MCC-ACA-ST-064-88.
Approaches to Combinatorial Optimization
269
Peterson, C., and Siiderberg, B. 1989a. A new method for mapping optimization problems onto neural networks. Int. 1. Neural Syst. 1, 3. Peterson, C., and Soderberg, B. 1989b. The elastic net as a mean field theory neural network. Tech. Rep. LU TI’ 80-18. Simic, l? D. 1990. Statistical mechanics as the underlying theory of “elastic” and “neural” optimizations, Tech. Reps. CALT-68-1556 and C3P-787. Network: Corny. in Neural Syst. (in press). Van den Bout, D. E., and Miller, T. K., 111. 1988. A traveling salesman objective function that works. Proceedings of the I E E E International Conference on Neural Networks, p. 299. Wilson, G. V., and Pawley, G. S. 1988. On the stability of the travelling salesman problem algorithm of Hopfield and Tank. B i d . Cybernet. 58, 63. Wu, F. Y. 1983. The Potts model. Rev. Modern Phys. 54, 235. Yuille, A. 1990. Generalized deformable models, statistical physics, and matching problems. Neural Corny. 2, 1-24.
Received 1 February 90; accepted 22 May 90.
Communicated by Fernando Pineda
NOTE
Faster Learning for Dynamic Recurrent Backpropagation Yan Fang Terrence J.Sejnowski The Salk Institute, Computational Neurobiology Laboratory, 10010 N.Torrey Pines Road, La Jolla, C A 92037 U S A
The backpropagation learning algorithm for feedforward networks (Rumelhart et al. 1986) has recently been generalized to recurrent networks (Pineda 1989). The algorithm has been further generalized by Pearlmutter (1989) to recurrent networks that produce time-dependent trajectories. The latter method requires much more training time than the feedforward or static recurrent algorithms. Furthermore, the learning can be unstable and the asymptotic accuracy unacceptable for some problems. In this note, we report a modification of the delta weight update rule that significantly improves both the performance and the speed of the original Pearlmutter learning algorithm. Our modified updating rule, a variation on that originally proposed by Jacobs (1988), allows adaptable independent learning rates for individual parameters in the algorithm. The update rule for the ith weight, w,,is given by the delta-bar-delta rule:
) each epoch given by with the change in learning rate ~ , ( ton if &(t- I)&(t)> o if &(t - 1)&(t)< 0
(1.2)
otherwise where K, are parameters for an additive increase, and 4, are parameters for a multiplicative decrease in the learning rates E,, and
Neural Computation 2, 27G273 (1990) @ 1990 Massachusetts Institute of Technology
Dynamic Recurrent Backpropagation
271
where E ( t ) is the total error for epoch t, and
&(t)= (1 - &)S,(t)
+ 19,6,(t
-
1)
(1.4)
where 29, are momentum parameters. Unlike the traditional delta rule that performs steepest descent on the local error surface, the error gradient vector { b , ( t ) } and the weight update vector { Awl} have different directions. This learning rule assures that the learning rate E, will be incremented by K , if the error derivatives of consecutive epochs have the same sign, which generally means a smooth local error surface. On the other hand, if the error derivatives keep on changing sign, the algorithm decreases the learning rates. This scheme achieves fast parameter estimation while avoiding most cases of catastrophic divergences. In addition to learning the weights, the time constants in dynamic algorithms can also be learned by applying the same procedure. One problem with the above adaptational method is that the learning rate increments, K ~ were , too large during the late stages of learning when fine adjustments should be made. Scaling the increments to the squared error was found to give good performance:
This introduces a global parameter, A, but one that could be broadcast to all weights in a parallel implementation. We simulated the figure "eight" presented in Pearlmutter (1989) using the modified delta-bar-delta updating rule, the result of which is shown in Figure la. This is a task for which hidden units are necessary because the trajectory crosses itself. According to the learning curve in Figure lb, the error decreased rapidly and the trajectory converged within 2000 epochs to values that were better than that reported by Pearlmutter (1989) after 20,000 epochs.' We also solved the same problem using a standard conjugate gradient algorithm to update the weights (Press et al. 1988). The conjugate gradient method converged very quickly, but always to local minima (Figure lc). It has the additional disadvantage in a parallel implementation of requiring global information for the weight updates. We have successfully applied the above adaptational algorithm to other problems for which the original method was unstable and did not produce acceptable solutions. In most of these cases both the speed of learning and the final convergence were significantly improved (Lockery et al. 1990a,b). 'We replicated this result, but the original algorithm was very sensitive to the choice of parameters and initial conditions.
Yan Fang and Terrence J. Sejnowski
272
3.5 3 25 L
O
L
2 15 1
t
i
t
0.5
o !
0
0
10
20
30
Epoch
Epoch
(b)
(4
40
50
60
Figure 1: (a) Output from a trained network (solid) plotted against the desired figure (markers) after 1672 learning epochs. Initial weights were randomly sampled from -1.0 to 1.0 and initial time constants from 1.0 to 3.0. An upper limit of 10 and a lower limit of 0.01 were put on the range of the time constants to reduce instabilities. About 75% of the simulation runs produced stable solutions and this example had better than average performance. (b,c) Learning curve of the same situation as in (a). Parameters used: 4 = 0.5, 19 = 0.1, X = 0.01, time step size At = 0.25. Final error E = 0.005. Average CPU time per epoch (on a MIPS M/120) was 0.07 sec. Notice the dramatic spiking after the first plateau. (c) Learning curve using a conjugate gradient method started with the same initial weights and time constants. Final error E = 1.7. Average CPU time per epoch was 2 sec.
Dynamic Recurrent Backpropagation
273
References Jacobs, R. A. 1988. Increased rates of convergence through learning rate adaptation. Neural Networks 1(4),295-307. Lockery, S. R., Fang, Y., and Sejnowski, T. J. 1990a. Neural network analysis of distributed representations of sensorimotor transformations in the leech. In Neurai lnformafion Processing Systems 1989, D. Touretzky, ed. MorganKaufmann, Los Altos. Lockery, S. R., Fang, Y., and Sejnowski, T. J. 1990b. A dynamic neural network model of sensorimotor transformations in the leech. Neural Comp. 2, 274282. Pearlmutter, 8.A. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1(2),263-269. Pineda, F. J. 1989. Generalization of back-propagation to recurrent neural networks. Phys. Rev. Lett. 19(59),2229-2232. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. 1988. Numerical Recipes in C, Chapter 10. Cambridge University Press, Cambridge. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by backpropagating errors. Nature (London) 323, 533-536.
Received 8 January 90; accepted 23 May 90.
Communicated by Richard Andersen
A Dynamic Neural Network Model of Sensorimotor Transformations in the Leech Shawn R. Lockery Yan Fang Terrence J. Sejnowski Computational Neurobiological Laboratory, Salk Institute for Biological Studies, Box 85800, Sun Diego, C A 92138 U S A Interneurons in leech ganglia receive multiple sensory inputs and make synaptic contacts with many motor neurons. These "hidden" units coordinate several different behaviors. We used physiological and anatomical constraints to construct a model of the local bending reflex. Dynamic networks were trained on experimentally derived inputoutput patterns using recurrent backpropagation. Units in the model were modified to include electrical synapses and multiple synaptic time constants. The properties of the hidden units that emerged in the simulations matched those in the leech. The model and data support distributed rather than localist representations in the local bending reflex. These results also explain counterintuitive aspects of the local bending circuitry. 1 Introduction Neural network modeling techniques have recently been used to predict and analyze the connectivity of biological neural circuits (Zipser and Andersen 1988; Lehky and Sejnowski 1988; Anastasio and Robinson 1989). Neurons are represented as simplified processing units and arranged into model networks that are then trained to reproduce the inputautput function of the reflex or brain region of interest. After training, the receptive and projective field of hidden units in the network often bear striking similarities to actual neurons and can suggest functional roles of neurons with inputs and outputs that are hard to grasp intuitively. We applied this approach to the local bending reflex of the leech, a threelayered, feedforward network comprising a small number of identifiable neurons whose connectivity and input-output function have been determined physiologically. We found that model local bending networks trained using recurrent backpropagation (Pineda 1987; Pearlmutter 1989) to reproduce a physiologically determined input-output function contained hidden units whose connectivity and temporal response properties closely resembled those of identified neurons in the biological network. Neural Computation 2,274-282 (1990) @ 1990 Massachusetts Institute of Technology
Dynamic Neural Network Model in the Leech
275
The similarity between model and actual neurons suggested that local bending is produced by distributed representations of sensory and motor information. 2 The Local Bending Reflex
In response to a mechanical stimulus, the leech withdraws from the site of contact (Fig. la). This is accomplished by contraction of longitudinal muscles beneath the stimulus and relaxation of longitudinal muscles on the opposite side of the body, resulting in a U-shaped local bend (Kristan 1982). The form of the response is independent of the site of stimulation: dorsal, ventral, and lateral stimuli produce an appropriately oriented withdrawal. Major input to the local bending reflex is provided by four pressure sensitive mechanoreceptors called I' cells, each with a receptive field confined to a single quadrant of the body wall (Fig. lb). Output to the muscles is provided by eight types of longitudinal muscle motor neurons, one to four excitatory and inhibitory motor neurons for each body wall quadrant (Stuart 1970; Ort et al. 1974). Motor neurons are connected by chemical and electrical synapses that introduce the possibility of feedback among the motor neurons. Dorsal, ventral, and lateral stimuli each produces a pattern of P cell activation that results in a unique pattern of activation and inhibition of the motor neurons (Lockery and Kristan 1990a). Connections between sensory and motor neurons are mediated by a layer of interneurons (Kristan 1982). Nine types of local bending interneurons have been identified (Lockeryand Kristan 1990b). These comprise the subset of the local bending interneurons that contributes to dorsal local bending because they are excited by the dorsal P cell and in turn excite the dorsal excitatory motor neuron. There appear to be no functional connections between interneurons. Other interneurons remain to be identified, such as those that inhibit the dorsal excitatory motor neurons. Interneuron input connections were determined by recording the amplitude of the postsynaptic potential in an interneuron while each of the P cells was stimulated with a standard train of impulses (Lockery and Kristan 1990b). Output connections were determined by recording the amplitude of the postsynaptic potential in each motor neuron when an interneuron was stimulated with a standard current pulse. Most interneurons received substantial input from three or four P cells, indicating that the local bending network forms a distributed representation of sensory input (Fig. lc). 3 Neural Network Model
Because-sensory input is represented in a distributed fashion, most interneurons are active in all forms of local bending. Thus, in addition
S. R. Lockery, Y. Fang, and T. J. Sejnowski
276
a
b
Left
Rlght
Dorsal , 2,
Venlral
Lateral
.,-
& A “!!/
C
left
\ riB
ht
excitatory
inhibitory
Figure 1: (a) Local bending behavior. Partial view of a leech in response to dorsal, ventral, and lateral stimuli. (b) Local bending circuit. The main input to the reflex is provided by the dorsal and ventral P cells (PD and PV). Control of local bending is largely provided by motor neurons whose field of innervation is restricted to single left-right, dorsal-ventral quadrants of the body; dorsal and ventral quadrants are innervated by both excitatory (DE and VE) and inhibitory (DI and VI)motor neurons. Motor neurons are connected by electrical synapses (resistor symbol) and excitatory (triangle) and inhibitory (filled circle) chemical synapses. Sensory input to motor neurons is mediated by a layer of interneurons. Interneurons that were excited by PD and that in turn excite DE have been identified (hatched) ; other types of interneurons remain to be identified (open). (c) Input and output connections of the nine types of dorsal local bending interneurons. Within each gray box, the upper panel shows input connections from sensory neurons, the middle panel shows output connections to inhibitory motor neurons, and the lower panel shows output connections t‘o excitatory motor neurons. Box area is proportional to the amplitude of the connection determined from intracellular recordings of interneurons or motor neurons. White boxes indicate excitatory connections and black boxes indicate inhibitory connections. Blank spaces denote connections whose strength has not been determined.
Dynamic Neural Network Model in the Leech
277
to contributing to dorsal local bending, most interneurons are also active during ventral and lateral bending when some or all of their output effects are inappropriate to the observed behavioral response. This suggests that the inappropriate effects of the dorsal bending interneurons must be offset by other as yet unidentified interneurons and raises the possibility that local bending is the result of simultaneous activation of a population of interneurons with multiple sensory inputs and both appropriate and inappropriate effects on many motor neurons. It was not obvious, however, that such a population was sufficient, given the constraints imposed by the input-output function and connections known to exist in the network. The possibility remained that interneurons specific for each form of the behavior were required to produce each output pattern. To address this issue, we used recurrent backpropagation (Pearlmutter 1989) to train a dynamic network of model neurons (Fig. 2a). The network had four input units representing the four P cells, and eight output units representing the eight motor neuron types. Between input and output units was a single layer of 10 hidden units representing the interneurons. Neurons were represented as single electrical compartments with an input resistance and time constant. The membrane potential (K) of each neuron was given by
where and R, are the time constant and input resistance of the neuron and I, and I, are the sum of the electrical and chemical synaptic currents from presynaptic neurons. Current due to electrical synapses was given by
where gij is the coupling conductance between neuron i and j . To implement the delay associated with chemical synapses, synapse units (s-units) were inserted between pairs of neurons connected by chemical synapses. The activation of each s-unit was given by
where Tzjis the synaptic time constant and f ( K ) was a physiologically determined sigmoidal function (0 5 f 5 1) relating pre- and postsynaptic membrane potential at an identified monosynaptic connection in the leech (Granzow et al. 1985). Current due to chemical synapses was given by
I,
=
c
WySt.?
3
where wiJ is the strength of the chemical synapse between units i and j . Thus synaptic current is a graded function of presynaptic voltage, a common feature of neurons in the leech (Friesen 1985; Granzow et al. 1985;
S. R. Lockery, Y. Fang, and T. J. Sejnowski
278
Rlght
Sensory neurons
@@
neurons
Figure 2 (a) The local bending network model. Four sensory neurons were connected to eight motor neurons via a layer of 10 intemeurons. Neurons were represented as single electricalcompartmentswhose voltage varied as a function of time (see text). Known electrical and chemical connections among motor neurons were assigned fixed connection strengths (g's and w's in the motor layer) determined from intracellular recordings. Interneuron input and output connections were adjusted by recurrent backpropagation. Chemical synaptic delays were implemented by inserting s-units between chemically connected pairs of neurons. S-units with different time constants were inserted between sensory and interneurons to account for fast and slow components of synaptic potentials recorded in interneurons (see Fig. 4). (b)Output of the model network in response to simultaneous activation of both PDs (stim). The response of each motor neuron (rows) is shown before and after training. The desired responses from the training set are shown on the right for comparison (target).
Thompson and Stent 1976)and other invertebrates (Katz and Miledi 1967; Burrows and Siegler 1978; Nagayama and Hisada 1987). Chemical and electrical synaptic strengths between motor neurons were determined by recording from pairs of motor neurons and were not adjusted by the training algorithm. Interneuron input and output connections were given small initial values that were randomly assigned and subsequently adjusted during training. During training, input connections were constrained to be positive to reflect the fact that only excitatory intemeuron input connections were seen (Fig. lc), but no constraints were placed on the number of input or output connections. Synaptic time constants were assigned fixed values. These were adjusted by hand to fit
Dynamic Neural Network Model in the Leech
279
the time course of motor neuron synaptic potentials (Lockery and Kristan 1990a), or determined from pairwise motor neuron recordings (Granzow et al. 1985). 4 Results
Model networks were trained to produce the amplitude and time course of synaptic potentials recorded in all eight motor neurons in response to trains of P cell impulses (Lockery and Kristan 1990a). The training set included the response of all eight motor neurons when each P cell was stimulated alone and when P cells were stimulated in pairs. After 6,000-10,000 training epochs, the output of the model closely matched the desired output for all patterns in the training set (Fig. 2b). To compare interneurons in the model network to actual interneurons, simulated physiological experiments were performed. Interneuron input connections were determined by recording the amplitude of the postsynaptic potential in a model interneuron while each of the P cells was stimulated with a standard current pulse. Output connections were determined by recording the amplitude of the postsynaptic potential in each motor neuron when an interneuron was stimulated with a standard current pulse. Model and actual interneurons were compared by counting the number of input and output connections that were sufficiently strong to produce a postsynaptic potential of 0.5 mV or more in response to a standard stimulus. Model interneurons (Fig. 3a), like those in the real network (Fig. lc), received three or four substantial connections from P cells and had significant effects on most of the motor neurons (Fig. 3b and c). Most model interneurons were therefore active during each form of the behavior and the output connections of the interneurons were only partially consistent with each form of the local bending response. Thus the appropriate motor neuron responses were produced by the summation of many appropriate and inappropriate interneuron effects. This result demonstrates that the apparently appropriate and inappropriate effects of dorsal local bending interneurons reflect the contribution these interneurons make to other forms of the local bending reflex. There was also agreement between the time course of the response of model and actual interneurons to P cell stimulation (Fig. 4). In the actual network, interneuron synaptic potentials in response to trains of P cell impulses had a fast and slow component. Some interneurons showed only the fast component, some only the slow, and some showed both components (mixed). Although no constraints were placed on the temporal response properties of interneurons, the same three types of interneuron were found in the model network. The three different types of interneuron temporal response were due to different relative connection strengths of fast and slow s-units impinging on a given interneuron (Fig. 2a).
S. R. Lockery, Y. Fang, and T. J. Sejnowski
280
.
.
dotmi ventrai-
oSmV
oSmV oSmV
0 excitatory b
inhibitory C
Number of input connections
Number of output connections
Figure 3: (a) Input and output connections of model local bending interneurons. Model interneurons, like the actual interneurons, received substantial inputs from three or four sensory neurons and had signhcant effects on most of the motor neurons. Symbols as in Figure lc. (b,c) Number of interneurons in model (hatched) and actual (solid) networks with the indicated number of significant input and output connections. Input connections (b) and output connections (c) were considered sigruficantif the synaptic potential in the postsynaptic neuron was greater than 0.5 mV. Counts are given in percent of all interneurons because model and actual networks had different numbers of interneurons.
5 Discussion
Our results show that the network modeling approach can be adapted to models with more realistic neurons and synaptic connections, including electrical connections, which occur in both invertebrates and vertebrates. The qualitative similarity between model and actual interneurons demonstrates that a population of interneurons resembling the identified dorsal local bending interneurons could mediate local bending in a distributed
Dynamic Neural Network Model in the Leech
Data
281
Model
Slow
-a
Fast
L
Figure 4: Actual (data) and simulated (model) synaptic potentials recorded from three types of interneuron. Actual synaptic potentials were recorded in response to a train of P ceIl impulses (stim). Simulated synaptic potentials were recorded in response to a pulse of current in the P cell, which approximates a step change in P cell firing frequency. processing system without additional interneurons specific for different forms of local bending. Interneurons in the model also displayed the diversity in temporal responses seen in interneurons in the leech. This represents an advance over our previous model in which temporal dynamics were not represented (Lockery et al. 1989). Clearly, the training algorithm did not produce exact matches between model and actual interneurons, but this was not surprising since the identified local bending interneurons represent only a subset of the interneurons in the reflex. More exact matches could be obtained by using two pools of model interneurons, one to represent identified neurons, the other to represent unidentified neurons. Model neurons in the latter pool wouId constitute testable physiological predictions of the connectivity of unidentified local bending interneurons.
Acknowledgments Supported by the Bank of America-Giannini Foundation, the Drown Foundation, and the Mathers Foundation.
282
S. R. Lockery, Y. Fang, and T. J. Sejnowski
References Anastasio, T., and Robinson, D. A. 1989. Distributed parallel processing in the vestibulo-oculomotor system. Neural Comp. 1,230-241. Burrows, M., and Siegler, M. V. S. 1978. Graded synaptic transmission between local interneurones and motor neurones in the metathoracic ganglion of the locust. J. Physiol. 285, 231-255. Friesen, W. 0. 1985. Neuronal control of leech swimming movements: Interactions between cell 60 and previously described oscillator neurons. 7. Comp. Physiol. 156, 231-242. Granzow, B., Friesen, W. O., and Kristan, W. B., Jr. 1985. Physiological and morphological analysis of synaptic transmission between leech motor neurons. J. Neurosci. 5, 2035-2050. Katz, B., and Miledi, R. 1967. Synaptic transmission in the absence of nerve impulses. J. Physioi. 192, 407436. Kristan, W. B., Jr. 1982. Sensory and motor neurons responsible for local bending in the leech. J. Exp. B i d . 96,161-180. Lehky, S. R., and Sejnowski, T. J. 1988. Network model of shape-from-shading: Neural function arises from both receptive and projective fields. Nature (London) 333,452-454. Lockery, S. R., and Kristan, W. B., Jr. 1990a. Distributed processing of sensory information in the leech. I. Input-output relations of the local bending reflex. J. Neurosci. 10, 1811-1815. Lockery, S. R., and Kristan, W. B., Jr. 1990b. Distributed processing of sensory information in the leech. 11. Identification of interneurons contributing to the local bending reflex. 1. Neurosci. 10, 1815-1829. Lockery, S. R., Wittenberg, G., Kristan, W. B., Jr., and Cottrell, W. G. 1989. Function of identified interneurons in the leech elucidated using neural networks trained by back-propagation. Nature (London) 340,468471. Nagayama, T., and Hisada, M. 1987. Opposing parallel connections through crayfish local nonspiking interneurons. J. Comp. Neurol. 257,347-358. Ort, C. A., Kristan, W. B., Jr., and Stent, G. S. 1974. Neuronal control of swimming in the medicinal leech. 11. Identification and connections of motor neurones. J. Comp. Physiol. 94, 121-154. Pearlmutter, B. A. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1,263-269. Pineda, F. 1987. Generalization of backpropagation to recurrent neural networks. Phys. Rev. Lett. 19, 2229-2232. Stuart, A. E. 1970. Physiological and morphological properties of motoneurones in the central nervous system of the leech. J. Physiol. 209, 627-646. Thompson, W. J., and Stent, G. S. 1976. Neuronal control of heartbeat in the medicinal leech. J. Comp. Physiol. 111, 309-333. Zipser, D., and Andersen, R. A. 1988. A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature (London) 331, 679-684. Received 25 January 90; accepted 4 April 90
Communicated by Idan Segev
Control of Neuronal Output by Inhibition at the Axon Initial Segment Rodney J. Douglas Department of Physiology, LICT Medical Schooi, Observatory 7925, South Africa
Kevan A. C. Martin MRC Anatomical Neuropharmacology Unit, Department of Pharmacology, Oxford University, South Parks Road, Oxford OX1 3QT, UK We examine the effect of inhibition on the axon initial segment (AIS) by the chandelier ("axoaxonic") cells, using a simplified compartmental model of actual pyramidal neurons from cat visual cortex. We show that within generally accepted ranges, inhibition at the AIS cannot completely prevent action potential discharge: only small amounts of excitatory synaptic current can be inhibited. Moderate amounts of excitatory current always result in action potential discharge, despite AIS inhibition. Inhibition of the somadendrite by basket cells enhances the effect of AIS inhibition and vice versa. Thus the axoaxonic cells may act synergistically with basket cells: the AIS inhibition increases the threshold for action potential discharge, the basket cells then control the suprathreshold discharge. 1 Introduction
The action potential output from neurons is generated at a specialized region of the axon called the axon initial segment (AIS). The frequency of the action potential discharge is directly proportional to the amount of inward synaptic current that arrives at the AIS (Fig. 1).Thus inhibition located at the AIS would seem to offer a very effective means of controlling the output of a neuron. In the cerebral cortex, as elsewhere in the brain (e.g., hippocampus, amygdala), a specialized GABAergc neuron (Freund et al. 1983) called the chandelier cell (SzentAgothai and Arbib 19741, or axoaxonic cell, makes its synapses exclusively the AIS of pyramidal neurons (Fig. 1; Somogyi et al. 1982). Another type of putative inhibitory neuron, the basket cell, targets exclusively the soma and dendrites (Fig. 1; see Somogyi et al. 1983; Martin 1988). Many have speculated as to the chandelier cell's exact function, but all agree it probably acts to inhibit the output of pyramidal neurons (see review Peters 1984). The basket cells, which are also GABAergic (Freund et al. 1983), are thought to act Neural Computation 2, 283-292 (1990) @ 1990 Massachusetts Institute of Technology
284
Rodney J. Douglas and Kevan A. C. Martin
in a more graded manner to produce the stimulus selective properties of cortical neurons (see Martin 1988). Although chandelier cells have been recorded in the visual cortex and appear to have normal receptive fields (Martin and Whitteridge, unpublished), nothing is known of their synaptic action. Since experimental investigation of their action is not yet possible, we have used computer simulations of cortical neurons to study the inhibitory effects of AIS inhibition and its interaction with somadendritic inhibition by basket cells.
2 Model
We took two pyramidal neurons that had been completely filled by intracellular injection of horseradish peroxidase and reconstructed them in 3-D (TRAKA, CeZTek, UK). The detailed structure of the dendritic arbor and soma were transformed into a simple equivalent neuron that consisted of an ellipsoidal somatic compartment, and three to four cylindrical compartments (Fig. 1) (Douglas and Martin 1990). The dimensions of a typical AIS (approximately 50 x lpm) were obtained from electron microscopic studies (Fair& and Valverde 1980; Sloper and Powell 1978). The effects of inhibition of these model cells was investigated using a general neuronal network simulating program (CANON, written by R. J. D.). The program permits neurons to be specified as sets of interconnected compartments, each of which can contain a variety of conductance types. The surfaces of the compartments represented the neuronal membrane. The leak resistance of this membrane was lOkR cm-’ = 0.1 mS cm-’, and the capacitance C , was 1pF cmP2.The specific intracellular resistivity of the compartments was O.lkR/cm. With these values the somatic input resistances of the two model pyramids were 20.9 and 95.7 MR, respectively. In addition to the leak conductances the membranes of the soma, AIS, and nodes of Ranvier (first two nodes) contained active sodium and potassium conductances that mediated Hodgkin-Huxley-like action potentials. The behavior of these compartments and their interaction were computed in the usual way (see, e.g., Getting 1989; Segev et al. 1989; Traub and Llinhs 1979). In this study we were concerned only with the effect on the peak action potential discharge and so we did not incorporate the conductances associated with spike adaptation. Action potentials depended only on Hodgkin-Huxley-like sodium spike conductances and delayed potassium conductances. We found that a maximum sodium spike conductance of 100 mS and a maximum delayed potassium conductance of 60 mS cm-’ were required to generate normal looking action potentials in the layer 5 pyramidal cell. These values were also then used for the layer 2 pyramid. Similar values were reported by Traub and Llin6s (1979) in their simulation of hippocampal pyramidal cells.
285
Inhibition at the Axon Initial Segment
a
b
C
L5 Pyramid
Figure 1: (a) Schematic that summarizes anatomical data concerning the synaptic input to pyramidal neurons (filled shape) of putative inhibitory synapses (open triangles) derived from basket and chandelier (axoaxonic) cells. Excitatory synapses (filled triangles) are shown making contact with dendritic spines. IS initial segment. (b) Montage of actual cortical pyramidal neurons from layers 2 and 5, and the idealized simpified model cells used in simulations. Full axon collateral network not shown. Each idealized cell consists of a cylindrical axon initial segment, an ellipsoidal soma (shown here as sphere), and a number of cylinders that represent the dendritic tree. Dimensions of ellipsoid and cylinders were obtained from detailed measurements of actual neurons shown; 100 pm scale bar refers to actual neurons and vertical axis of model neurons; 50 pm scale bar refers to horizontal axis of model neuron only. (c) Currentdischarge relationships of model cells shown in (b). In this and the following figures, current-discharge plots are fitted with a power function. For clarity, only the fitted line is shown in the following figures.
286
Rodney J. Douglas and Kevan A. C. Martin
Average inhibitory conductances ranged between 0.1 and 1 mS cm-2 on the soma and proximal dendritic cylinder, and between 1 and 10 mS cm-’ on the AIS. In our simulations the AIS was represented by two cylindrical compartments (each 25 pm long x lpm diameter) in series, which were interposed between the soma and the onset of the myelinated axon. Since this AIS of a superficial pyramidal neuron has an area of about 150 x cm’, and receives about 50 synapses (Fairen and Valverde 1980; Sloper and Powell 1978; Somogyi et al. 1982), individual inhibitory synapses applied to the AIS compartments had maximum conductances of about 0.3 nS. The average inhibitory conductance was held constant over the period of excitatory current injection in order to model the sustained inhibition that may prevail in the visual cortex during receptive field stimulation (Douglas et al. 1988; Koch et al. 1990). Program CANON was written in TurboPascal. It was executed on a 25-MHz 80386/80387 RM Nimbus VX, and simulated the pyramidal neuron’s response to a 200 msec current injection in 10 min.
3 Results and Discussion
The response of the two pyramidal neurons to different values of current injected in the soma is shown in Figure 1. These current-discharge curves are similar to those seen in actual cortical neurons in vitro (Connors et al. 1988). In the model they could best be fitted with a power function (solid lines). The steeper response and higher peak rate for the layer 2 pyramidal cell is due to the decreased current load offered by the smaller dendritic arbor. When inhibition is applied to the AIS (Fig. Z), the threshold current increases, that is, more current is now required before the neuron begins to discharge. The reduction in discharge achieved by AIS inhibition is shown by the difference (dotted line) in the discharge rate between control (solid line) and the inhibited case (dashed line) for the layer 2 pyramid. Surprisingly, the maximum discharge rate is hardly altered at all. The relative effectiveness of the inhibition is conveniently expressed as a percent (difference/control), shown here for the layer 5 pyramid (dotted line). This clearly shows the dramatic fall-off in the effectiveness of the chandelier cell inhibition as more excitatory current reaches the AIS. Thus the chandelier inhibition acts to increase the current threshold and is effective only for small excitatory currents that produce low discharge rates; it is relatively ineffective for currents that produce high discharge rates (75+ spikes/sec). In the model, AIS inhibition achieves its effect by sinking the current that flows into the AIS, and preventing activation of the local sodium spike conductance. The effectiveness of the AIS inhibition could be
Inhibition at the Axon Initial Segment
G \
287
200
u)
a
Y
.d
n
150
u)
v
m
100
L
cl
J 0Z
50
u) .r(
0
0
‘;j 200 \
b
L5 P y r
u)
(II
Y
n
150
u) W
a
m
100
L
cl
JZ 0
50
u)
O
n
1
2
3
4
C u r r e n t (nA>
Figure 2: (a) Current-discharge relationship of the model layer 2 pyramidal neuron, and (b) of the model layer 5 neuron, before (solid) and during (dashed) inhibition of the AIS. Difference between control and inhibited case shown as dotted line for layer 2 pyramid; expressed as percent inhibition for layer 5 pyramid. Inhibitory conductance, 5 mS cmP2; inhibitory reversal potential, -80 mV.
288
Rodney J. Douglas and Kevan A. C. Martin
IOmS.
&?
-60mV
IOmS.
&?
-80mV
C u r r e n t (nA)
Figure 3: ( a d ) Effect of AIS inhibitory conductance and inhibitory reversal potential on current-dischargerelationship of the model layer 2 pyramidal neuron, dashed; percent inhibition, dotted. Inhibitory conductance and reversal potential are given above each of the four cases shown.
increased by having a larger inhibitory conductance or a reversal potential for the inhibitory synapse that is much more negative than the resting membrane potential (Fig. 3). But even when these two factors were combined (Fig. 3d) action potential discharge could not be prevented. In this case (Fig. 3d) the current threshold was about 0.6 nA, which is well within the operating range of these neurons. The maximum inhibitory conductance in the AIS is a small fraction ( 5 10%) of the sodium spike conductance. So, although the inhibitory conductance may prevent activation of the spike current, it will not have much effect on the trajectory of the spike once it has been initiated. The AIS spike depolarizes the adjacent active soma to its threshold. The somatic spike, in turn, drives the depolarization of the relatively passive dendritic arbor. The somatic and dendritic trajectories lag behind the AIS
Inhibition at the Axon Initial Segment
289
spike, and so are able to contribute excitatory current to the AIS during its redepolarization phase. The rather surprising result that postsynaptic inhibition is relatively ineffective for strong excitatory inputs arises as a consequence of the saturation of the current discharge curve: if sufficient current can be delivered to the AIS, the neuron will always be able to achieve its peak discharge rate, which is determined by the kinetics of the spike conductances, the membrane time constant, and the neuronal input resistance. In the case where the inhibitory synaptic conductances are small and act to hyperpolarize the neuron, the membrane time constant and neuronal input resistance are hardly altered. The current-discharge curve then shifts to the right on the X-axis (Fig. 3a), that is, the threshold increases, but the shape of the curve remains the same as the control. In the case of large inhibitory conductances, the neuronal input resistance is reduced and the membrane time constant shortens. The inhibitory conductance shunts the excitatory current and this has the effect of raising the threshold. However, if the remaining excitatory current is sufficient to drive the membrane to threshold, the shorter time constant permits the membrane to recharge more quickly after each action potential. This means that the neuron can fire faster for a given excitatory current, offsetting some of the effects of inhibition. Hence the current-discharge curve becomes steeper, but achieves approximately the same maximum discharge for the same current input as the control (Fig. 3b). If the axoaxonic inputs alone were required to inhibit the neuron completely via the mechanism modeled here, they could approximate complete blockade only by forcing the threshold to the upper end of the operational range of excitatory current (say 1.5-2 nA). To produce this threshold would require an inhibitory conductance of the same order as the sodium spike conductance. If excitatory input did then exceed the threshold, the neuron would immediately respond with a high discharge rate: a catastrophic failure of inhibition. Such high conductances have not been seen experimentally. A volley of excitation, which would activate nonspecifically all inhibitory neurons, elicits inhibitory conductances in cortical pyramidal cells in vivo that are less than 20% of their input conductance (see Martin 1988). Basket cell axons terminate on the soma and proximal dendrites of pyramidal cells (Somogyi et al. 1983; see Martin 1988). We modeled their inputs by applying a uniform conductance change to the soma and the proximal dendritic cylinder. The effect of basket cell inhibition was similar to axoaxonic inhibition (Fig. 4). The increase in threshold and steepened slope of the current discharge curve seen in the model, have also been observed in vitro when the GABABagonist baclofen was applied to cortical neurons (Connors et al. 1988). We simulated the combined action of AIS and basket inhibition. When the basket inhibition is relatively hyperpolarizing, the AIS and basket mechanisms sum together (Fig. 4c). When the basket inhibition is relatively shunting, the two mechanisms
Rodney J. Douglas and Kevan A. C. Martin
290
-2
200
0.SmS.
0.2mS. cm, -80mV
-2
cm.
-60mV
3
.-
(u
_-__
50 0 L
! I
s+1
80n 60-
<
40-
- 20x
C u r r e n t (nA)
Current
(nA)
Figure 4 Interaction of chandelier and basket cell inhibition and their effects on the current-discharge relationship of the model layer 2 pyramidal neuron. (a, b) Current discharge curve. (c, d) Percent inhibition. c, control; i, AIS inhibition alone; s, somadendrite inhibition; s+i, combined somadendrite and AIS inhibition; same conventions for all traces. AIS conductance and reversal potential 5 mS cm-', -80 mV. Somadendrite inhibitory conductances and reversal potentials. 0.2 mS cm-2, -80 mV for (a) and (c); 0.5 m!3 -60 mV for (b) and (d).
exhibit some synergy (Fig. 4d). This combination of effects would permit the basket cell inhibition to conserve its action for just that part of the signal that exceeds the noise threshold imposed by the AIS inhibition. However, the general failure of both basket and axoaxonic cells to inhibit strong excitation reinforces our previous proposal that the cortical inhibitory circuitry is designed to control small amounts of excitation (Douglas et al. 1989).
Inhibition at the Axon Initial Segment
291
4 Conclusion For reasonable conductances and thresholds within the operating range, the AIS inhibition cannot totally prevent action potential discharge. It does offer a mechanism for enhancing the signal-to-noise ratio in that small signals are blocked by AIS inhibition, while larger signals pass relatively unaffected. Control of such a global property as threshold is consistent with the strategic location of the chandelier cell input to the AIS. This allows the chandelier and basket cells to have complementary roles: the AIS inhibition could control the signal threshold, while the somatodendritic inhibition could act more selectively to shape the suprathreshold signal. Acknowledgments We thank John Anderson for technical assistance, Hermann Boeddinghaus for performing some preliminary computations, and the E. P. Abrahams Trust and the Wellcome Trust for additional support. R. J. D. acknowledges the support of the SA MRC. References Connors, B. W., Malenka, R. C., and Silva, L. R. 1988. Two inhibitory postsynaptic potentials, and GABAA and GABAB receptor-mediated responses in neocortex of rat and cat. J. Physiol. 406, 443468. Douglas, R. J., Martin, K. A. C., and Whitteridge, D. 1988. Selective responses of visual cortical cells do not depend on shunting inhibition. Nature (London) 332,642-644. Douglas, R. J., Martin, K. A. C., and Whitteridge, D. 1989. A canonical microcircuit for neocortex. Neural Comp. 1,479437. Douglaw, R. J., and Martin, K. A. C. 1990. A functional microcircuit for cat visual cortex. Submitted. Fairen, A., and Valverde, F. 1980. A specialized type of neuron in the visual cortex of Cat: A Golgi and electron microscope study of Chandelier cells. J. Comp. Neurol. 94, 761-779. Freund, T. F., Martin, K. A. C., Smith, A. D., and Somogyi, P. 1983. Glutamate decarboxylase-immunoreactive terminals of Golgi-impregnated axoaxonic cells and of presumed basket cells in synaptic contact with pyramidal neurons of the cat's visual cortex. J. Comp. Neurol. 221,263-278. Getting, A. P. 1989. Reconstruction of small neural networks. In Methods in Neuronal Modelling, C. Koch and I. Segev, eds., pp. 171-194. Bradford Books, Cambridge, MA. Koch, C., Douglas, R. J., and Wehmeier, U. 1990. Visibility of synaptically induced conductance changes: Theory and simulations of anatomically characterized cortical pyramidal cells. J. Neurosci. 10, 1728-1744.
292
Rodney J. Douglas and Kevan A. C. Martin
Martin, K. A. C. 1988. From single cells to simple circuits in the cerebral cortex. Quart. 1. Exp. Physiol. 73, 637-702. Peters, A. 1984. Chandelier cells. In Cerebral Cortex. Vol. 1. Cellular Components of the Cerebral Cortex, pp. 361-380. Plenum Press, New York. Segev, I., Fleshman, J. W., and Burke, R. E. 1989. Compartmental models of complex neurons. In Methods in Neuronal Modelling, C. Koch and I. Segev, eds., pp. 171-194. Bradford Books, Cambridge, MA. Sloper, J. J., and Powell, T. P. S. 1978. A study of the axon initial segment and proximal axon of neurons in the primate motor and somatic sensory cortices. Phil. Trans. R. SOC. London Ser. B 285, 173-197. Somogyi, P., Freund., T. F., and Cowey, A. 1982. The axo-axonic interneuron in the cerebral cortex of the rat, cat and monkey. Neuroscience 7, 2577-2608. Somogyi, P., Kisvarday, Z . F., Martin, K. A. C., and Whitteridge, D. 1983. Synaptic connections of morphologically identified and physiologically characterized large basket cells in the striate cortex of the cat. Neuroscience 10,261-294. SzentAgothai, J., and Arbib, M. B. 1974. Conceptual models of neural organization. Neurosci. Res. Prog. Bull. 12, 306-510. Traub, R. D., and Llinb, R. 1979. Hippocampal pyramidal cells: Sigruhcance of dendritic ionic conductances for neuronal function and epileptogenesis. 1. Neurophysiol. 42, 476-496.
Received 5 February 90; accepted 10 June90.
Communicated by Christof Koch
Feature Linking via Synchronization among Distributed Assemblies: Simulations of Results from Cat Visual Cortex R. Eckhorn H. J. Reitboeck M. Arndt P. Dicke Biophysics Depart men t, Phil i p s - University, Renthof 7, 0-3550 Marburg, Federal Republic of Germany
We recently discovered stimulus-specific interactions between cell assemblies in cat primary visual cortex that could constitute a global linking principle for feature associations in sensory and motor systems: stimulus-induced oscillatory activities (35-80 Hz) in remote cell assemblies of the same and of different visual cortex areas mutually synchronize, if common stimulus features drive the assemblies simultaneously. Based on our neurophysiological findings we simulated feature linking via synchronizations in networks of model neurons. The networks consisted of two one-dimensional layers of neurons, coupled in a forward direction via feeding connections and in lateral and backward directions via modulatory linking connections. The models’ performance is demonstrated in examples of region linking with spatiotemporally varying inputs, where the rhythmic activities in response to an input, that initially are uncorrelated, become phase locked. We propose that synchronization is a general principle for the coding of associations in and among sensory systems and that at least two distinct types of synchronization do exist: stimulus-forced (event-locked) synchronizations support ”crude instantaneous” associations and stimulus-induced (oscillatory) synchronizations support more complex iterative association processes. In order to bring neural linking mechanisms into correspondence with perceptual feature linking, we introduce the concept of the linking field (association field) of a local assembly of visual neurons. The linking field extends the concept of the invariant receptive field (RF) of single neurons to the flexible association of RFs in neural assemblies. 1 Experimental Results from Cat Visual Cortex
Stimulus-related oscillations of neural activities were recently discovered in the primary visual cortex of monkey (Freeman and van Dijk 1987) and cat (Gray and Singer 1987; Eckhorn et al. 1988). These neurophysiological results, together with theoretical proposals (e.g., Grossberg 1983; Neural Computation 2, 293-307 (1990) @ 1990 Massachusetts Institute of Technology
294
R. Eckhorn et al.
Reitboeck 1983, 1989; von der Malsburg 1985; Damasio 19891, support the hypothesis that synchronization might be a mechanism subserving the transitory linking of local visual features into coherent global percepts (Eckhorn et al. 1988,1989a,b; Eckhorn 1990; von der Malsburg and Singer 1988; Gray et al. 1989). In cat cortical areas 17 and 18 the stimulus-related tuning of the oscillatory components (35-80 Hz) of local mass activities [multiple unit spike activities (MUA) and local slow wave field potentials (LFPs)] often correlate with the tuning properties of single cells at the same recording position (Gray and Singer 1987; Eckhorn et al. 1988; 1989a,b). Large oscillation amplitudes of LFPs and MUAs are preferentially induced by sustained binocular stimuli that extend far beyond the limits of the classical single cell receptive fields (Eckhorn et al. 1988; Eckhorn 1990). Stimulus-induced oscillations of LFPs and MUA appear as oscillation spindles of about 80-250 msec duration, separated by intervals of stochastic activity, and their response latencies are significantly longer than the primary components of the stimulus-locked visually evoked cortical potentials (VECPs; Fig. 1A). These fast components of the VECPs survive event-locked averaging while the oscillation spindles are averaged out due to the variability of their response latencies and due to their variable frequencies (Fig. 1B). Furthermore, it has been shown that signal oscillations are generated in local cortical assemblies (Gray and Singer 1987; Eckhorn et al. 1988; Chagnac-Amitaiand Connors 1989) and that assembly oscillations at spatially remote positions of the same cortex area can be synchronized by simultaneous stimulation of the assemblies’ receptive fields (Eckhorn et al. 1988, 1989a,b; Eckhorn 1990; Gray et al. 1989). We found, in addition, stimulus-specific synchronizations among assemblies in different cortex areas if the assemblies code common visual features (e.g., similar orientation or receptive field position; Eckhorn et al. 1988, 1989a,b; Eckhorn 1990). 2 Neural Network Model
Some of the synchronization effects we had observed in cat visual cortex were studied by us in computer simulations of neural network models. Our model neurons (Eckhorn et al. 1989a) have specific properties that are essential for the emergence of cooperative signal interactions in neural assemblies. The model neuron has two functionally different types of synapses: the feeding synapses are junctions in the main (directly stimulusdriven) signal path, whereas the linking synapses receive auxiliary (synchronization) signals that “modulate” the feeding inputs (Fig. 2A; see also Appendix). The model neurons have dynamic ”synapses” that are represented by leaky integrators (Fig. 2A). During a synaptic input pulse the integrator
Feature Linking
295
Stimulus-SpecificSynchronizations in Cat V i a l Cortex
A
1
A17
1
B
4.2
-0.0
0.2
0.4
0.6
0.a
1.0
tirne/s
i
start
stimulus movement
Figure 1: Two types of stimulus-related synchronizations among visual cortical areas A27 and A18 of the cut: (1)primary, visually evoked (event-locked)field potentials (VECP) and (2) stimulus-induced oscillatory potentials. (A) Single-sweep local field potential (LFP) responses (bandpass: 10-120 Hz). (B) Averages of 18 LFP responses to identical stimuli. Note that the oscillatory components (480-900 msec poststimulus) are averaged out, while averaging pronounces the primary evoked potentials (50-120 msec). Binocular stimulation: drifting grating; 0.7 cycles/deg swept at 4 deg/sec; movement starts at t = 0 and continues for 3.4 sec. Simultaneous recordings with linear array of seven microelectrodes from layers II/III; the RFs of A17 and A18 neurons overlapped partially.
is charged and its output amplitude rises steeply. This is followed by an exponential decay, determined by the leakage time constant. The decaying signal does permit prolonged “postsynaptic” interactions, such as temporal and spatial integration and amplitude modulation (see below). The spike encoder with its adaptive properties is also realized via a leaky integrator, in combination with a differential amplitude discriminator and spike former. The amplitude discriminator triggers the spike former when its input, the ”membrane voltage” U,(t), exceeds the variable threshold O(t). An output spike of the neuron immediately charges the leaky integrator to such a high value of O ( t ) that Um(t) cannot exceed @(t)during and immediately after the generation of a n output spike. This transitory elevation of O ( t ) produces absolute and relative “refractory periods” in the spike generation. The spike encoder responds to a positive jump of Urnwith a sudden increase in its discharge rate. After the first burst of spikes, subsequent spikes appear at increasingly longer intervals, since the burst charged the threshold integrator to a high value
R. Eckhorn et al.
296
output
4
C ....
layer 1
Figure 2: The model neuron and network. (A) Circuit diagram. Linking and feeding inputs with leaky integrators on a single “dendrite” interact multiplicatively. Signals from different “dendrites” are summed and fed to the “spike encoder” in the “soma.” (B) Symbol for neuron in A. (C) Neural network as used in simulations Figures 3B,D and 4A,B. Thick lines: feeding connections; thin lines: linking connections. Full output connectivity is shown only for one layer 1 and one layer 2 neuron (hatched symbols). of @(t).A negative jump of Urn,especially after a burst of spikes, leads to an abrupt pause in the discharge, until the output of the threshold integrator went down to a suitably low value of Urn.For the present simulations this “temporal contrast enhancement” is a desirable property of the spike encoder, because it supports the formation of “isolated bursts.” Such bursts are efficient temporal “markers” for fast and strong synchronizations among connected neurons. In special types of real neurons the
Feature Linking
297
burst-supporting properties might already be due to specific characteristics of their dendritic membrane (for a discussion see Chagnac-Amitai and Connors 1989). Another desirable effect of the spike encoder’s adaptation is the ”automatic scaling” of the output spike rate to submaximal values, even at maximal input amplitudes of Urn (maximal rates occur only transiently). The strong negative feedback in the spike encoder is the main stabilizing factor in our neural networks. It prevents instability and saturation over a wide range of variations in the synaptic parameters, even when synaptic couplings are exclusively positive. Our model neuron has an important characteristic that assures linking between neurons without major degradation of their local coding properties (the local coding of a model neuron corresponds to receptive field properties of a real visual neuron). This preservation of local coding properties is due to the modulatory action the linking inputs exert on the feeding inputs: the integrated signals from the linking inputs, together with a constant offset term (+l),interact multiplicatively with the integrated signals from the feeding inputs. Without linking input signals, the output of the “neural multiplier” is identical to the (integrated) feeding signals (mdtiplication by +l). This interaction ensures fast and relatively unaffected signal transfer from the feeding synapses, which is an important requirement for fast “stimulus-forced synchronizations” and for the preservation of the ”RF properties” (discussed below). With nonzero activity at the linking inputs, the integrated signal from the feeding inputs is modulated via the multiplier, and the threshold discriminator will now switch at different times, thereby shifting the phase of the output pulses. In network models of other groups that also use synchronization for feature linking, a possible degradation of the neuron’s local coding properties by certain types of linking networks has not yet been considered (e.g., Kammen et al. 1989; Sporns et al. 1989). The modulatory synapses we used in our model neuron are neurophysiologically plausible: modulation in real neurons might be achieved by changes in the dendritic membrane potential due to specific types of synapses that act locally (or via electrotonic spread, also distantly) on voltage-dependent subsynaptic channels, thereby modulating the synapses’ postsynaptic efficacy. In neocortical circuitry, it seems probable that mainly a subgroup of special ”bursting neurons” is mutually coupled via linking connections (Chagnac-Amitai and Connors 1989) and only such a network has been modeled by us. The neural networks used in the present simulations consist of two one-dimensional layers of neurons (Fig. 2 0 . In order to simplify the simulations, each neuron of layer 1 has only one feeding input to which the “visual” input is fed as an analog signal. Prior to the simulations of dynamic network interactions, the amplitudes of these signals were derived from stimulus intensity distributions by application of local filters with appropriate spatiotemporal characteristics. Noise was added to the
298
R. Eckhorn et al.
feeding signals in order to mimic irregularities due to the superposition of spike inputs to many similar feeding synapses, and to simulate internal stochastic components. The local coding of layer 2 neurons is determined by the convergence of feeding inputs from four neighboring layer 1 neurons. Convergence causes enlarged “sensitive areas” due to the superposition of partially overlapping sensitive areas of layer 1 neurons (correspondingto a receptive field enlargement due to convergence from real visual neurons with smaller RFs). Within each layer, local coding properties of neurons are assumed to be identical, except for the positions of their sensitive areas that were chosen to be equidistantly aligned. In the present simulations, each neuron in both layers is connected to four neighbors at each side. The (positive) coupling strength of the linking synapses linearly declines with lateral distance. For simplicity, forward feeding and backward linking connections were chosen to be of equal strength. The model parameters (such as time constants, coupling strengths, and number of convergent and divergent connections) can be varied over a relatively wide range without changing the basic phaselock properties of the network (for simulation parameters see Table 1 in Appendix). Even the addition of spike propagation delays (proportional to distance) does not crucially deteriorate phase linking, as long as the delays (in the region of cells to be linked) are shorter than the time constant of the feeding synapses. The dynamic response of the network is shown in the following simulations of region linking (Figs. 3 and 4). In the first example, two stimulus ”regions” (patches) of enhanced intensity are applied to the feeding inputs of layer 1 (Fig. 3). To demonstrate the model’s robustness in generating stimulus-induced synchrony we introduced two impediments: (1) the stimulus amplitudes at the patches differed by a factor of two, which causes the burst rates of the driven neurons to differ appreciably, and (2) the stimuli were not switched on simultaneously, but in temporal succession. ”Stimulus-forced”synchronizationsdue to phase locking on-transients were thus prevented. In the simulations in Figure 3A the feedback linking connections from layer 2 to layer 1 are blocked. Rhythmic bursts that are synchronized within each stimulus patch region are generated in both layers but the burst frequencies at the two stimulus patch positions differ markedly. Calculation of the cross-correlation between the activities at the centers of the stimulus patches (signalsa and b, respectively, in Fig. 3) shows a flat correlogram, indicating uncorrelated activities (Fig. 3C, center). Synchronization, however, quickly develops if the layer 2 to 1 feedback linking connections are activated (Fig. 3B). The “interpatch” burst synchronizationis quantified in the cross-comelogram (Fig. 3D) (for more details see legend). “Interpatch” synchronization occurs mainly because layer 2 neurons are affected by and affect large regions in layer 1: layer 2 neurons receive convergent inputs from four layer 1 neurons, they have interlayer connections to four neighbors at
Feature Linking
299
Figure 3: Neural network simulation of stimulus-induced oscillatory synchronizations among two “stimulated patches“ (a and b) of the lower (one-dimensional) layer 1 (50 neurons) mediated via feedback from the higher layer 2 (23 neurons). The neural network in Figure 2C is used. Occurrence times of ”action potentials” are marked by dashes. Analog noise was continuously added to all layer 1 feeding inputs (statistically independent for all inputs). Black horizontal bars indicate stimulus duration for “patches” a and b, respectively. (A) Feedback linking connections from layer 2 to 1 are blocked. Note the “intrapatch” burst synchronization and the “interpatch independence in layer 1 (stimulus subregions a and b). (B) Connectivity as in Figure 2C, including feedback. Note the ”interpatch” burst synchronizations in layers 1 and 2, and the precise separation of synchronized activity patches in layer 1, while those in layer 2 bridge (“fill-in”) the stimulus gap. (C) Correlograms with blocked feedback linking connections from layer 2 to 1: auto- and cross-correlograms of the spike activities in the centers of the “stimulus patches” a and b (of A), derived from simulation runs that were 20 times longer than the shown duration. The flat cross-correlogram indicates independence of the two rhythmic burst activities. (D) Correlograms with active feedback. Oscillatory cross-correlogram indicates synchronization among bursts in “patches” a and b. A time bin was equivalent to 1/600 of the plots’ length in Figs. 3 and 4.
300
R. Eckhorn et al.
Figure 4: Facilitation and synchronization of spatially separate stimulus-evoked actiwities via lateral and feedback linking connections. (Network as in Figs. 2C and 3B,D). (A) Two stimulus impulses were presented in short succession at layer 1 feeding inputs in the positions of "patch 1," while a single impulse only was given at "patch 2" inputs. Note the synchronized bursts of spikes in both layers at the former "patch 2" position although the second stimulus was applied at "patch 1" position only. This effectof spatiotemporal "filling-in" is due to the facilitatory action of signals supplied via the lateral and feedback linking connections on the "response tails" at the feeding inputs' leaky integrators. (B) Phase lock among synchronized bursts of spikes in two subregions activated by a pair of moving "stimulus" subfields. Note the "filling-in" across the "stimulus gap" via layer 2 neurons.
each side, and they project back to four layer 1 neurons. Layer 2 neurons can thus span (fill in) the gap of nonstimulated neurons in layer 1. A further demonstration of interesting capabilities of our model network is given in the simuIations in Figure 4. In Figure 4A, synchronization of the activities in two separate stimulus patches is mainly forced by applying two strong input impulses simultaneously. Shortly after the initial stimulation, a second impulse is given at patch 1 position only. Synchronized activity appears, however, also at patch 2. This effect of "filling-in" across spatial and temporal gaps is due to the facilitatory action of signals supplied via the lateral and feedback linking connections, interacting with the temporal "response tails" of the feeding inputs' leaky integrators. The simulation in Figure 48 demonstrates linking among separate regions when the stimulus patches move in parallel. Again, synchronization of the activated subregions is achieved via lateral and feedback linking connections of layer 2. Although a high level of noise was contin-
Feature Linking
301
uously applied to the feeding inputs of layer 1, the synchronized moving "activity patches" maintain almost constant extent and clear separation. In such dynamic spatiotemporal input situations, neurons at the region boundaries must either join or leave the synchronized assembly. Region linking is thus accomplished by our model, even though the synchronized "patches" are moving, and are separated by a gap of spontaneously active neurons. Such synchronization from a higher to a lower level might also be responsible for the stimulus-induced synchronizations we observed in cat visual cortex (Eckhorn et al. 1988). Stimulation with coarse gratings not only induced synchronized oscillations within the stimulated region of a single stripe in area 17, but also induced synchronized activities among A17 positions that were stimulated by other, neighboring stripes. Parallel recordings from A17 and A18 showed that assemblies in A17 and A18 could be synchronized if the A18 neurons had overlapping receptive fields (RFs) and common coding properties with the synchronized A17 assemblies (Eckhorn et al. 1988; Eckhorn 1990). 3 Stimulus-Forced and Stimulus-Induced Synchronization
We proposed that two different types of synchronization support perceptual feature linking in visual cortex: stimulus-forced and stimulusinduced synchronizations (Eckhorn et al. 1989b). Sfimulus-forced synchronizations are directly driven by fast and sufficiently strong stimulus transients, that is, they are generally not 0scillatory, but follow the time course of the stimulus transients. Oscillatory stimulus-locked "wavelets," however, are evoked by strong transient stimuli (like whole field flashes). Such "wavelets" appear in retinal, geniculate, and even in cortical neural assemblies (for a review of "visual wavelets" see Basar 1988 and Kraut et al. 1985). In a recent neural network simulation of our group we showed that near "neural synchrony," initially generated by a stimulus transient that was applied to several neighboring feeding inputs in parallel, is enhanced via lateral and/or feedback linking connections of the same type as those mediating the stimulus-induced oscillatory synchronizations (Pabst et al. 1989; see also Fig. 4A). Stimulus-forced synchronizations are likely to play a major role not only at primary processing stages of the visual system (retina and lateral geniculate body), but in all areas of the visual cortex. Stimulusforced synchronizations have been observed in the visual cortex since the early beginnings of cortical electrophysiology: their overall manifestation are the averaged visually evoked potentials. Detailed results about such stimulus-locked synchrony of intracortical signals were recently obtained with current source density analyses (CSD; e.g., Mitzdorf 1987). Most interesting with respect to the "linking concept" is the fact, that
302
R. Eckhorn et al.
synchronized CSD responses are evoked not only via direct RF-activation at the recording positions, but also by large-area stimuli that are far away from the RFs of the neurons at the recording position. Another important finding is that a characteristic activation sequence of synchronized CSD signals traverses the cortical layers. Both effects are largely independent of specific stimulus features (Mitzdorf 1987). We suspect that this type of event-locked synchrony is mediated via linking interactions similar to those in our models. Stimulus-induced synchronizations (recorded as oscillatory mass activities) are assumed to be produced internally via a self-organizing process among stimulus-driven local "oscillators" that are mutually connected. The results of our simulations support this assumption. Feeding signals with moderate transients cause our model neurons to respond initially with rather irregular repetitive discharges, uncorrelated among different neurons, but subsequently the neurons mutually synchronize their activities via the linking connections. In our simulations, the same linking network supports phase locking of both stimulus-forced and stimulusinduced synchronizations. It is therefore reasonable to assume this is likely to be the case also in visual cortex. We have experimental evidence that stimulus-forced (event-locked) and stimulus-induced synchronization are complementary mechanisms. In our studies of single epoch responses in cat visual cortex we found that oscillatory activities were never present during and immediately after strong stimulus-locked visual evoked cortical potentials (VECPs). From this observation we conclude that evoked potentials of sufficient amplitudes can suppress stimulus-induced oscillations in cortical assemblies. We assume that event-locked (stimulus-forced)synchronizations serve to define crude instantaneous "preattentive percepts," and stimulus-induced oscillatory synchronization mainly support the formation of more complex, "attentive percepts" that require iterative interactions among different processing levels and memory (see also the models of Grossberg 1983; Damasio 1989; Sporns et al. 1989). The latter statement is supported by the finding that during states of focused attention oscillatory signals in the 35-60 Hz range were observed in sensory and motor cortices of humans and of a variety of mammals (review: Sheer 1989). In what visual situations might either of the two types of synchronization support perceptual feature linking? During sudden changes in the visual field, stimulus-forced synchronizations are generated that dominate the activities of visual neurons, thereby suppressing spontaneous and visually induced oscillations. During periods of fixation or slow eye drifts (and perhaps also during mental visualization and "attentive drive"), retinal signals do not provide strong synchronous transients. In such "temporally ill-defined visual situations, cortical cell assemblies themselves can generate the synchronous signal transients necessary for the generation of temporal linking, by generating sequences of synchro-
Feature Linking
303
nized bursts at 35-80 Hz. Sensory systems that predominantly receive well-timed signals, like the auditory system, might be dominated by stimulus-forced synchronization, whereas those with sluggish temporal responses of their receptors, like the olfactory system, would employ mechanisms for the self-generation of repetitive synchronized activities in order to form temporal association codes (for oscillations and coding in olfactory system see, e.g., Freeman and Skarda 1985).
4 Linking Fields
In order to bring neuronal mechanisms of feature linking into correspondence with perceptual functions, we introduce the concept of the linking fieZd (association field) of a local neural assembly. The linking field extends the concept of single cell receptive fields to neural ensemble coding. We define the linking field of a local assembly of visual neurons to be that area in visual space where appropriate local stimulus features can initiate (stimulus-forced or stimulus-induced) synchronizations in the activities of that assembly. During synchronized states linking fields are "constructed" by the constituent neurons, according to their receptive field and network properties. Linking fields of local assemblies in cat primary visual cortex (A17 and A18), as derived during periods of stimulus-induced oscillations, were generally found to be much broader and less stimulus specific than the receptive fields of the assemblies' neurons (and even less specific than the superposition of their FWs). The wide extent of the linking fields in visual space is most certainly due to the large extent of the divergent linking connections in visual cortex. Such broad connectivities within and among cortical areas, the so-called "association fiber systems," have been known for many years from anatomical studies. They were often designated as "mismatch of "excitatory" connections (e.g., Salin et al. 1989), since they extend far beyond the range required for the "composition" of the respective receptive fields.
Acknowledgments We thank our colleagues Dr. R. Bauer, H. Baumgarten, M. Brosch, W. Jordan, W. Kruse, M. Munk, and T. Schanze for their help in experiments and data processing, and our technicians W. Lenz, U. Thomas, and J. H. Wagner for their expert support. This project was sponsored by Deutsche Forschungsgemeinschaft Ba 636/4-1,2, Re 547/2-1, and Ec 53/4-1 and by Stiftung Volkswagenwerk I/35695 and I/64605.
R. Eckhorn et al.
304
Appendix A Mathematical Description of the Model-Neurons
~
The membrane voltage Um,k(t)of the kth neuron is given by Um,k(t) =
&(t) .
+Mt)l
(A.1)
where Fk(t) is the contribution via the feeding-inputs and &(t) is the contribution via the linking-inputs. Fk(t) is calculated from
N: Wki: f
Yz(t): sk(t): Nk(t): I,*,,.
I(V",T
number of neurons; synaptic weight of feeding input from zth to kth neuron; spike output of ith neuron; analog stimulus input to kth neuron (only for layer 1); analog "noise" input; identical bandwidth and standard deviation for each neuron; statistically independent in each neuron; convolution operator; ~t): , impulse response of leaky-integrators at the feeding inputs;
Convolutions, e.g.
X ( t ) = Z ( t ) * I(V,T , t ) , are computed via the digital filter equation
+ V .Z [ n ]
X [ n ]= X [ n- 11 . exp(-I/T)
where t = n . A increases in discrete time steps of duration A The contribution via linking-inputs (index "1") is
(A.4) = 1.
The output of the spike-encoder of the kth neuron is given by:
The thresholds time course is derived by
&(t)= 0,
+Yk(t) * Z ( V ,
with a threshold-offset 00.
t)
TS,
(-4.7)
Feature Linking
305
Fig. 3 Fig. 4A Fig. 4B Feeding-Inputs: Time-constant T~ Amplification V"
10.0 0.5
15.0 0.5
10.0 0.5
Linking-Inputs: Time-constant 7' Amplification V'
1.0 5.0
0.1 30.0
1.o 5.0
Threshold: Time-constant T" Amplification V" Offset 00
7.5 50.0 0.5
5.0 70.0 1.8
5.0 50.0 0.5
Table 1: Simulation-Parameters.
References
Basar, E. 1988. EEG-Dynamics and Evoked Potentials in Sensory and Cognitive Processing by the Brain. Springer Series in Brain Dynamics 1, E. Basar (ed.), pp. 30-55. Springer-Verlag, Berlin. Chagnac-Amitai, Y.,and Connors, B. W. 1989. Horizontal spread of synchronized activity in neocortex and its control by GABA-mediated inhibition. J. Neurophysiol. 62, 1149-1162. Damasio, A. R. 1989. The brain binds entities and events by multiregional activation from convergence zones. Neural Comp. 1, 121-129. Eckhorn, R. 1990. Stimulus-specificsynchronizations in the visual cortex: Linking of local features into global figures? In Neuronal Cooperativity, J. Kruger, ed., (in press). Springer-Verlag, Berlin. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Multiple electrode and correlation analysis in the cat. B i d . Cybernet. 60, 121-130. Eckhorn, R., Reitboeck, H. J., Arndt, M., and Dicke, P. 1989a. Feature linking via stimulus-evoked oscillations: Experimental results from cat visual cortex and functional implications from a network model. Proceed. Int. Joint Conf. Neural Networks, Washington. IEEE TAB Neural Network Comm., San Diego, I, 723-730.
306
R. Eckhorn et al.
Eckhorn, R., Reitboeck, H. J., Arndt, M., and Dicke, P. 1989b. A neural network for feature linking via synchronous activity: Results from cat visual cortex and from simulations. In Models of Brain Function, R. M. J. Cotterill, ed., pp. 255-272. Cambridge Univ. Press, Cambridge. Freeman, W., and Skarda, C. A. 1985. Spatial EEG patterns, non-linear dynamics and perception: The Neo-Sherringtonian view. Brain Res. Rev. 10, 147-1 75. Freeman, W. J., and van Dijk, B. W. 1987. Spatial patterns of visual cortical fast EEG during conditioned reflex in a rhesus monkey. Bruin Res. 422, 267-276. Gray, C. M., and Singer, W. 1987. Stimulus-dependent neuronal oscillations in the cat visual cortex area 17. 2nd IBRO-Congress, Neurosci. Suppl. 1301P. Gray, C. M., Konig, I?, Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Grossberg, S. 1983. Neural substrates of binocular form perception: Filtering, matching, diffusion and resonance. In Synergetics of the Brain, E. Basar, H. Flohr, H. Haken, and A. J. Mandell, eds., pp. 274-298. Springer-Verlag, Berlin. Kammen, D. M., Holmes, P. J., and Koch, C. 1989. Cortical architecture and oscillations in neuronal networks: Feedback versus local coupling. In Models of Bruin Function, R. M. J. Cotterill, ed., pp. 273-284. Cambridge Univ. Press, Cambridge. Kraut, M. A., Arezzo, J. C., and Vaughan, H. G. 1985. Intracortical generators of the flash VEP in monkeys. Electroenceph. Clin. Neurophysiol. 62, 300-312. Mitzdorf, U. 1987. Properties of the evoked potential generators: Current source-density analysis of visually evoked potentials in the cat cortex. Int. j . Neurosci. 33, 33-59. Pabst, M., Reitboeck H. J., and Eckhorn, R. 1989. A model of pre-attentive region definition in visual patterns. In Models of Brain Function, R. M. J. Cotterill, ed., pp. 137-150. Cambridge Univ. Press, Cambridge. Reitboeck, H. J. 1983. A multi-electrode matrix for studies of temporal signal correlations within neural assemblies. In Synergetics of the Bruin, E. Basar, H. Flohr, H. Haken, and A. Mandell, eds., pp. 174-182. Springer-Verlag, Berlin. Reitboeck, H. J. 1989. Neuronal mechanisms of pattern recognition. In Sensory Processing in fhe Mammalian Brain, J. S. Lund, ed., pp. 307-330. Oxford UNv. Press, New York. Salin, P. A., Bullier, J., and Kennedy, H. 1989. Convergence and divergence in the afferent projections to cat area 17. J. Comp. Neurol. 283,486-512. Sheer, D. E. 1989. Sensory and cognitive 40-Hz event-related potentials: Behavioral correlates, brain function, and clinical application. In Springer Series in Brain Dynamics 2, E. Basar and T. H. Bullock, eds., pp. 339-374. SpringerVerlag, Berlin.
Feature Linking
307
Sporns, O., Gally, J. A., Reeke, G. N., and Edelman, G. M. 1989. Reentrant signaling among simulated neuronal groups leads to coherency in their oscillatory activity. Proc. Natl. Acad. Sci. U.S.A. 86, 7265-7269. Von der Malsburg, C. 1985. Nervous structures with dynamical links. Ber. Bunsengesell. Phys. Chem. 89, 703-710. Von der Malsburg, C., and Singer, W. 1988. Principles of cortical network organization. In Neurobiology of Neocortex, P. Rakic, and W. Singer, eds., pp. 69-99. Wiley, New York.
Received 11 December 1989; accepted 11 June 90.
Communicated by Geoffrey Hinton
Towards a Theory of Early Visual Processing Joseph J. Atick School of Natural Sciences, Institute for Advanced Study, Princeton, NJ 08540, U S A
A. Norman Redlich Department of Physics and Center for Neural Science, New York University, New York, NY 10003, USA
We propose a theory of the early processing in the mammalian visual pathway. The theory is formulated in the language of information theory and hypothesizes that the goal of this processing is to recode in order to reduce a ”generalized redundancy” subject to a constraint that specifies the amount of average information preserved. In the limit of no noise, this theory becomes equivalent to Barlow’s redundancy reduction hypothesis, but it leads to very different computational strategies when noise is present. A tractable approach for finding the optimal encoding is to solve the problem in successive stages where at each stage the optimization is performed within a restricted class of transfer functions. We explicitly find the solution for the class of encodings to which the parvocellular retinal processing belongs, namely linear and nondivegent transformations. The solution shows agreement with the experimentally observed transfer functions at all levels of signal to noise.
In the mammalian visual pathway, data from the photoreceptors are processed sequentially through successive layers of neurons in the retina and in the visual cortex. The early stages of this processing (the retina and the first few layers of the visual cortex) exhibit a significant degree of universality; they are very similar in many species and do not change as a mature animal learns new visual perceptual skills. This suggests that the early stages of the visual pathway are solving a very general problem in data processing, which is independent of the details of each species’ perceptual needs. In the first part of this paper, we formulate a theory of early visual processing that identifies this general problem. The theory is formulated in the language of information theory (Shannon and Weaver 1949) and was inspired by Barlow’s redundancy reduction hypothesis for perception (Barlow 1961, 1989). Barlow’s hypothesis is, however, applicable only to noiseless channels that are unrealistic. The Neural Computation 2, 308-320 (1990)
@ 1990 Massachusetts Institute of Technology
Theory of Early Visual Processing
309
theory that we develop here is formulated for noisy channels. It agrees with Barlow’s hypothesis in the limit of no noise but it leads to different computational strategies when noise is present. Our theory hypothesizes that the goal of visual processing is to recode the sensory data in order to reduce a redundancy measure, defined below, subject to a constraint that fixes the amount of average information maintained. The present work is an outgrowth of an earlier publication in which we addressed some of these issues (Atick and Redlich 1989). However, in that work the role of noise was not rigorously formulated, and although all solutions exhibited there did well in reducing redundancy, they were not proven to be optimal. For a related attempt to understand neural processing from information theory see Linsker (1986,1989) (see also Uttley 1979). The problem of finding the optimal redundancy reducing code among all possible codes is most likely impossible to solve. A more tractable strategy is to reduce redundancy in successive stages, where at each stage one finds the optimal code within a restricted class. This appears to be the mechanism used in the visual pathway. For example, in the ”parvocellular” portion of the pathway, which is believed to be concerned with detailed form recognition, the first recoding (the output of the retinal ganglion cells) can be characterized as linear and nondivergent (code dimension is unchanged). At the next stage, the recoding of the simple cells is still substantially linear but is divergent (for a review see Orban 1984). In this paper, we solve the problem of redundancy reduction for the class of linear and nondivergent codes and we find that the optimal solution is remarkably similar to the experimentally observed ganglion cell recoding. We leave the solution for the next stage of linear divergent codes, where one expects a simple cell like solution, for a future publication. 1 Formulation of the Theory For concreteness, we shall start by formulating our theory within the specific context of retinal processing. The theory in its more general context will become clear later when we state our redundancy reduction hypothesis. It is helpful to think of the retinal processing in terms of a pair of communication channels, as pictured in Figure 1. In this flow chart, the center box represents the retinal transfer function A , with the signal z representing the visual input including noise v, and y the output of the ganglion cells. Here, we do not concern ourselves with the detailed implementation of this transfer function by the retina, which involves a fairly complicated interaction between the photoreceptors and the layers of cells leading to the ganglion cells. Although the input z is the actual input to the visual system, we have introduced an earlier input communication channel in the flow diagram with s representing an ideal signal. This earlier communication channel
Joseph J. Atick and A. Norman Redlich
310
Y
,
\
An
channel y=Ax
is
/ \
Reccding
X /
A
S /
~
\
\
x=s+
Y
Figure 1: Schematic diagram showing the three stages of processing that the ideal signal s undergoes before it is converted to the output y. v (6 ) is the noise in the input (output) channel.
is taken to be the source of all forms of noise v in the signal 2, including quantum fluctuations, intrinsic noise introduced by the biological hardware, and semantic noise already present in the images. It must be kept in mind, however, that neither noise nor ideal signal is a universal concept, but depend on what is useful visual information to a particular organism in a particular environment. Here we assume that, minimally, the ideal signal does not include noise v in the form of completely random fluctuations. It is reasonable to expect this minimal definition of noise to apply to the early visual processing of all organisms. In this paper, we take the input z to be the discrete photoreceptor sampling of the two-dimensional luminosity distribution on the retina. For simplicity, we use a Cartesian coordinate system in which the sampled signal at point n = (711,712) is z[n]. However, all of the results below can be rederived, including the sampling, starting with spatially continuous signals, by taking into account the optical modulation transfer function of the eye (see the analysis in Atick and Redlich 1990). The output y is the recoded signal A z plus the noise 6 in the output channel. This channel may be thought of as the optic nerve. Since the recoding of z into y is linear and nondivergent, its most general form can be written as y[m] = C, A[m,n] z[n]+ 6[m], where the transfer function A[m,n]is a square matrix. In order to formulate our hypothesis, we need to define some quantities from information theory (Shannon and Weaver 1949) that measure
311
Theory of Early Visual Processing
how well the visual system is communicating information about the visual scenes. We first define the mutual information I ( z ,s ) between the ideal signal s and the actual signal z:
with a similar formula for I ( y , s ) . In equation 1.1, P ( s ) [or P ( x ) l is the probability of the occurrence of a particular visual signal s[n] (or x[n]), and P ( z ,s) is the probability of the joint occurrence of s[n]together with z[n]. I ( z ,s) measures the actual amount of useful information available at the level of the photoreceptors z, given that the desired signal is s. Likewise, 1 ( y , s) measures the useful information available in the signal y. Also, for continuous signals, the mutual information is a well-defined, coordinate invariant quantity. To calculate the mutual information I(., s) explicitly, it is necessary to know something about the probabilities P ( s ) , P ( z ) , and P ( z ,s). These probability functions, together with the relationship y = A z + 6, are also sufficient to calculate the mutual information I(y, s). Although P(s), P ( z ) ,and P ( x ,s) cannot be known completely, we do assume knowledge of the second-order correlators (s[n]s [m]), (z[n]z[m]), and (z[n]s[m]), where ( ) denotes the average over the ensemble of all visual scenes. We assume that these correlators are of the form
where &[n, m] is some yet unspecified correlation matrix, and we have defined (v[n]v[m])= N26,,,. Using z = s v, equations 1.2 imply that there are no correlations between the noise v and s. Given these correlators, we assume that the probability distributions are those with maximal entropy:
+
P(u)
=
[ ( 2 ~det(R,,)] )~
for u = s, x,y and R,,[n, m] = (u[n]u[m]) (d is the dimension of n). We have included here the mean ii = (u),although in all of our results i t drops out. Equation 1.3 can also be used to determine the joint proba-
JosephJ. Atick and A. Norman Redlich
312
bilities P ( z ,s) and P(y, s), since these are equal to P ( z z s )and P(zys)for the larger sets of stochastic variables z,, = (z,s) and zys = (y,s) whose correlators R,, are calculated from R,,, R,,, R,,, R,,, and Ry,. It is not difficult to show, using the explicit expressions for the various probability distributions, that
1
I ( y , s ) = -log
2
{
det [A(&
+ N2)AT+ N i ]
det (AN2AT+ N z )
The mutual informations depend (In 1.5, we used (6[n]6[m])= N@,,,.) on both the amount of noise and on the amount of correlations in the signals. Noise has the effect of reducing I ( z ,s) [or I ( y , s)] because it causes uncertainty in what is known about s at IC (or y). In fact, infinite noise reduces I ( z ,s) and I(y,s) to zero. This becomes clear in equations 1.4 and 1.5 as N 2 goes to infinity, since then the ratio of determinants goes to one causing I to vanish. Increasing spatial correlations in equations 1.4 and 1.5 also has the effect of reducing I because correlations reduce the information in the signals. Correlations indicate that some scenes are far more common than others, and an ensemble with this property has lower average information than one in which all messages are equally probable. The effect of increasing correlations is most easily seen, for example, in I ( z ,s) in the limit of N 2 very small, in which case I ( z ,s) log[det(&)]. If the average signal strengths (s2\n])are held constant, then log[det(&)] is maximum when ROis diagonal (no correlations) and vanishes when the signal is completely correlated, that is, &[n,m] = constant, b' n, m. In fact, by Wegner's theorem (Bodewig 1956, p. 71), for positive definite matrices (correlation matrices are always positive definite) det(Ro)5 K(&)%%, with equality only when & is completely diagonal. Having introduced a measure I ( y , s) of the actual average information available at y, we now define the channel capacity Cout(y)which measures the maximal amount of information that could flow through the output channel. Here, we define the capacity COut(y)as the maximum of I(y, w) varying freely over the probabilities P(w) of the inputs to the output channel, holding the average signal strengths (y2[n])fixed:
-
Theory of Early Visual Processing
313
+
where y = w 6 and (R,,),, are the diagonal elements of the autocorrelator of w (wis a dummy variable, which in Fig. 1 corresponds to Ax). A constraint of this sort is necessary to obtain a finite capacity for continuous signals and is equivalent to holding constant the average "power" expenditure or the variance in the number of electrochemical spikes sent along each fiber of the optic nerve.' Using Wegner's theorem, (1.6) the maximum occurs for the probability distribution P(w)for which R,, or equivalently R,,, is diagonal:
which for y
=Az
+ 6 is explicitly
At this point, we are ready to state our generalized redundancy reduction hypothesis. We propose that the purpose of the recoding A of the visual signal in the early visual system is to minimize the "redundancy"
R
= 1 - I ( Y , S)/CO,t(Y)
(1.9)
subject to the constraint that I ( y , s ) be equal to the minimum average information l* that must be retained to meet an organism's needs. I(y,s ) is therefore constrained to be a fixed quantity and redundancy is reduced by choosing To avoid confusion, we should emphasize an A that minimizes C,,(y). that C,,, is fixed only at fixed "power," but can be lowered by choosing A to lower the output "power." Although, in practice we do not know precisely what the minimal I*is, we assume here that it is the information available to the retina, I ( z ,s), lowered slightly by the presence of the additional noise 6 in the output channel. We therefore choose the constraint I ( y , s ) = I* = I ( z
+ 6,s)
(1.10)
but our results below do not depend qualitatively on this precise form for the constraint. Since I' does not depend on A , it can be determined from physiological data, and then used to predict independent experiments (see Atick and Redlich 1990). The reader should be cautioned that equation 1.9 is not the conventional definition of redundancy for the total channel from s to y. The standard redundancy would be R = 1 - I(y,s)/CtOt(y) where C,,, is the maximum of I ( y , s) varying freely over the input probabilities P ( s ) , 'Since a ganglion cell has a nonvanishing mean output, "power" here is actually the cell's dynamic range.
314
Joseph J. Atick and A. Norman Redlich
keeping (y2) fixed. In contrast to Ctot, C,,, is directly related to the ”power” in the optic fiber, so reducing equation 1.9 in the manner just described always leads to lower ”power” expenditure. Also, since C,,, > Got, lowering C,,, necessarily lowers the “power” expenditure at all stages up to y, which is why we feel equation 1.9 could be biologically more significant. Our hypothesis is similar to Barlow’s redundancy reduction hypothesis (Barlow 1961), with the two becoming identical when the system is free of noise v. In this limit, redundancy is reduced by diagonalizing the correlation matrix rC, by choosing the transfer matrix A such that 4, = A&AT is diagonal. With R,, diagonal, the relationship det(R,,) 5 IIz(Ryy)Ez becomes an equality giving C(y) = I ( y , s ) so the redundancy (1.9) is eliminated. [In reality, the redundancy (1.9) is a lower bound reflecting the fact that we chose probability distributions which take into account only second-order correlators. More complete knowledge of P ( s ) would lower I(z,s) and I(y, s) and therefore increase 72.1 Where reducing R in equation 1.9 differs considerably from Barlow’s hypothesis is in the manner of redundancy reduction when noise is significant. Under those circumstances, R in equation 1.9 is sizable, not because of correlations in the signal, but because much of the channel capacity is wasted carrying noise. Reducing equation 1.9 when the noise is large has the effect of increasing the signal-to-noise ratio. To do this the system actually increases correlations (more precisely increases the amplitude of the correlated signal relative to the noise amplitude), since correlations are what distinguish signal from noise. For large enough noise, more is gained by lowering the noise in this way than is lost by increasing correlations. For an intermediate regime, where signal and noise are comparable, our principle leads to a compromise solution, which locally accentuates correlations, but on a larger scale reduces them. All these facts can be seen by examining the properties of the explicit solution given below. Before we proceed, it should also be noted that Linsker (1986) has hypothesized that the purpose of the encoding A should be to maximize the mutual information I(y,s), subject to some constraints. This differs from the principle in this paper which focuses on lowering the output channel capacity while maintaining the minimum information needed by the organism. While both principles may be useful to gain insight into the purposes of neural processing in various portions of the brain, in the early visual processing, we beIieve that the primary evolutionary pressure has been to reduce output channel capacity. For example, due to much lower resolution in peripheral vision, the amount of information arriving at the retina is far greater than the information kept. It is difficult to believe that this design is a consequence of inherent local biological hardware constraints, since higher resolution hardware is clearly feasible, as seen in the fovea.
Theory of Early Visual Processing
315
2 Explicit Solution
To actually minimize R we use a lagrange multiplier X to implement the constraint (equation 1.lo) and minimize E{A)
= C(Y) - W(Y, 3) - I ( Z
+ 6,s)l
(2.1)
with respect to the transfer function A, where C(y), [(a,s), and [(y, s ) are given in equations 1.8, 1.4, and 1.5, respectively. One important property of R[n, m] (in equation 1.2) that we shall assume is translation invariance, R[n, m] = R[n - m], which is a consequence of the homogeneity of the ensemble of all visual scenes. We can take advantage of this symmetry to simplify our formulas by assuming A[n,m] = A[n - m]. With this in equation 1.7 are all equal assumption, the diagonal elements (Ryy)8i and hence minimizing C(Y)is equivalent to minimizing the simpler expression Tr(A R AT). Using the identity log(detB) = Tr(1ogB) for any positive definite matrix B, and replacing C(y) by Tr(A R AT), equation 2.1 becomes E{A}
1 Ni
= -J R
dwA(w) R(w) A(-w)
--*
A(w)R(w)A(-w) + N i A(w)N2A(-w) + N i N2
+ N;
1 (2.2)
where all variables are defined in momentum space through the standard discrete two-dimensional fourier transform, for example, A(w) = A ( q , w2)
e-Zm'WA[m]
= m
It is straightforward to see from equation 2.2 that the optimal transfer function A(w) satisfies the following quadratic equation:
where we have defined F(w) = A(w).A(-w)/N;. The fact that A appears only through F , is a manifestation of the original invariances of I and C under orthogonal transformations on the transfer function A[m], that is, under A + U A with UT U = 1. Equation 2.3 has only one positive solution for F , which is given explicitly by (2.4)
316
Joseph J. Atick and A. Norman Redlich
where X is determined by solving I(y, s ) = I(z + 6, s). After eliminating F the latter equation becomes
In general, equation 2.5 must be solved for X numerically. The fact that the transfer function A appears only through F leads to a multitude of degenerate solutions for A, related to each other by orthogonal transformations. What chooses among them has to be some principle of minimum effort in implementing such a transfer function. For example, some of the solutions are nonlocal (by local we mean a neighborhood of a point n on the input grid is mapped to the neighborhood of the corresponding point n on the output grid), so they require more elaborate hardware to implement; hence we examine local solutions. Among these is a unique solution satisfying A(w) = A(-w), which implies that it is rotationally invariant in coordinate space. We compare it to the observed retinal transfer function (ganglion kernel), known to be rotat'ionally symmetric. Since rotation symmetry is known to be broken at the simple cell level, it is significant that this formalism is also capable of producing solutions that are not rotationally invariant even when the correlation function is. It may be that the new features of the class of transfer functions at that level (for example, divergence factor) will lift the degeneracy in favor of the nonsymmetric solutions. (In fact, in one dimension we find solutions that break parity and look like one-dimensional simple cells kernels.) The rotationally invariant solution is obtained by taking the square root of F in equation 2.4 (we take the positive square root, corresponding to on-center cells). In what follows, we examine some of its most important properties. To be specific, we parameterize the correlation function by a decaying exponential
with D the correlation length measured in acuity units and S the signal amplitude. We have done numerical integration of equations 2.4 and 2.5 and determined A[m] for several values of the parameters. In Figure 2, we display one typical solution, which was obtained with SIN = 2.0, D = 50, and N6 = 0.025. In that figure, empty disks represent positive (excitatory), while solid disks represent negative (inhibitory) components of A[m]. Also, the logarithm of the area of a disk is directly related to the amplitude of the component of A[m]at that location. As one can see, the solution has a strong and rather broad excitatory center with a weaker and more diffuse surround. A very significant feature of the theoretical profiles is their insensitivity to D (and to N&),which is necessary to account for the fact that the observed profiles measured in acuity units are similar in different species and at different eccentricities.
Theory of Early Visual Processing
317
Figure 2: Optimal transfer function, A[m], for nondivergent linear codes, with D = 50, SIN = 2, and Nn = 0.025. Open disks denote positive (excitatory) components of A[m]while solid disks denote negative (inhibitory)components. The area of a disk is directly related to the logarithm of A[m]at that location. To get more insight into this solution, let us qualitatively examine its behavior as we change SIN (for a detailed quantitative comparison with physiological data see Atick and Redlich 1990). For that, we find it more convenient to integrate out one of the dimensions (note this is not the same as solving the problem in one dimension). The resulting profile, corresponding to Figure 2, is shown in Figure 3b. In Figure 3, we have also plotted the result for two other values of S I N , namely for low and high noise regimes (Fig. 3a and c, respectively). These show that an interpolation is happening as S I N changes between the two extremes. Analytically, we can also see this from equation 2.4 for any & by taking the limit NIS + 0, where A(w) becomes equal to
One recognizes that this is the square root of the solution one gets by carrying out prediction on the inputs, a signal processing technique whch we advocated for this regime of noise (see also Srinivasan et al. 1982)
Joseph J. Atick and A. Norman Redlich
318
(4
S/N-O.l
I
I
I
I
I
-10
-5
0
5
10
Figure 3: (a-c) Optimal solution at three different values of SIN. These profiles have been produced from the two-dimensional solution by summing over one direction and normalizing the resulting profile such that the height central point is equal to the center height in the two-dimensional solution. as a redundancy reduction technique in our previous paper (Atick and Redlich 1989). The spatial profiles for the square root solution are very similar to the prediction profiles, albeit a bit more spread out in the surround region. This type of profile reduces redundancy by reducing the amount of correlations present in the signal. In the other regime, where noise is very large compared to the signal, the solution for A(w) (&/N2)*/4and has the same qualitative features as the smoothing solution (Atick and Redlich 1989) which in that limit is Asmoothg= & / N 2 . Smoothing increases the signal to noise of the output and, in our earlier work, we argued that it is a good redundancy reducing technique in that noise regime. Moreover, in that work, we
-
Theory of Early Visual Processing
319
argued that to maintain redundancy reduction at all signal-to-noise levels a process that interpolates between prediction and smoothing has to take place. We proposed a convolution of the prediction and the smoothing profiles as a possible interpolation (SPI-coding), which was shown to be better than either prediction or smoothing. In the present analysis, the optimal redundancy reducing transfer function is derived, and, although it is not identical to SPI-coding, it does have many of the same qualitative properties, such as the interpolation just mentioned and the overall center-surround organization. The profiles in Figures 2 and 3 are very similar to the kernels of ganglions measured in experiments on cats and monkeys. We have been able to fit these to the phenomenological difference of gaussian kernel for ganglions (Enroth-Cugell and Robson 1966). The fits are very good with parameters that fall within the range that has been recorded. Another significant way in which the theory agrees with experiment is in the behavior of the kernels as SIN is decreased. In the theoretical profiles, one finds that the size of the center increases, the surround spreads out until it disappears, and finally the overall scale of the profile diminishes as the noise becomes very large. In experiment, these changes have been noted as the luminosity of the incoming light (and hence the signal to noise) is decreased and the retina adapts to the lower intensity (see, for example, Enroth-Cugell and Robson 1966). This active process, in the language of the current theory is an adjustment of the optimal redundancy reducing processing to the SIN level. In closing, we should mention that many of the techniques used to derive optimal encoding for the spatial properties of visual signals can be directly applied to temporal properties. In that case, for low noise the theory would lead to a reduction of temporal correlations, which would have the effect of taking the time derivative, while in the high noise case, the theory would lead to integration. Both types of processing play a significant role in visual perception, and it will be interesting to see how well they can be accounted for by the theory. Another issue that should be addressed is the question of how biological organisms evolved over time to have optimal redundancy reducing neural systems. In our previous paper, we discovered an anti-Hebbian unsupervised learning routine which converges to the prediction configuration and a Hebbian routine which converges to the smoothing profiles. We expect that there exist reasonably local learning algorithms that converge to the optimal solutions described here.
Acknowledgments Work supported by the National Science Foundation, Grant PHYS8620266.
320
Joseph J. Atick and A. Norman Redlich
References Atick, J. J., and Redlich, A. N. 1989. Predicting the ganglion and simple cell receptive field organizations from information theory. Preprint no. IASSNS HEP-89/55 and NW-NN-89/1. Atick, J. J., and Redlich, A. N. 1990. Quantitative tests of a theory of early visual processing: I. Spatial contrast sensitivity profiles. Preprint no. IASSNSHEP90/51. Barlow, H. B. 1961. Possible principles underlying the transformation of sensory messages. In Sensory Communication, W. A. Rosenblith, ed. M.I.T. Press, Cambridge, MA. Barlow, H. B. 1989. Unsupervised learning. Neural Comp. 1,295-311. Bodewig, E. 1956. Matrix Calculus. North-Holland, Amsterdam. Enroth-Cugell, C., and Robson, J. G. 1966. The contrast sensitivity of retinal ganglion cells of the cat. J. Physiol. 187,517-552. Linsker, R. 1986. Self-organization in a perceptual network. Computer (March), 105-1 17. Linsker, R. 1989. An application of the principle of maximum information preservation to linear systems. In Advances in Neural Information Processing Systems, D. S. Touretzky, ed., Vol. 1, pp. 186-194. Morgan Kaufmann, San Mateo. Orban, G. A. 1984. Neuronal Operations in the Visual Cortex. Springer-Verlag, Berlin. Shannon, C. E., and Weaver, W. 1949. The Mathematical Theory of Communication. The University of Illinois Press, Urbana. Srinivisan, M. V., Laughlin, S. B., and Dubs, A. 1982. Predictive coding: A fresh view of inhibition in the retina. Proc. R. SOC.London Ser. B 216, 427-459. Uttley, A. M. 1979. Information Transmission in the Nervous System. Academic Press, London.
Received Y February 90; accepted 10 June YO.
Communicated by Richard Durbin
Derivation of Linear Hebbian Equations from a Nonlinear Hebbian Model of Synaptic Plasticity Kenneth D. Miller Department of Physiology, University of California, San Francisco, CA 94143-0444 USA
A linear Hebbian equation for synaptic plasticity is derived from a more complex, nonlinear model by considering the initial development of the difference between two equivalent excitatory projections. This provides a justification for the use of such a simple equation to model activity-dependent neural development and plasticity, and allows analysis of the biological origins of the terms in the equation. Connections to previously published models are discussed.
Recently, a number of authors (e.g., Linsker 1986; Miller et al. 1986, 1989) have studied linear equations modeling Hebbian or similar correlation-based mechanisms of synaptic plasticity, subject to nonlinear saturation conditions limiting the strengths of individual synapses to some bounded range. Such studies have intrinsic interest for understanding the dynamics of simple feedforward models. However, the biological rules for both neuronal activation and synaptic modification are likely to depend nonlinearly on neuronal activities and synaptic strengths in many ways. When are such simple equations likely to be useful as models of development and plasticity in biological systems? One critical nonlinearity for biological modeling is rectification. Biologically, a synaptic strength cannot change its sign, because a given cell's synapses are either all excitatory or all inhibitory. Saturating or similar nonlinearities that bound the range of synaptic strengths may be ignored if one is concerned with the early development of a pattern of synaptic strengths, and if the initial distribution of synaptic strengths is well on the interior of the allowed region in weight space. However, if a model's outcome depends on a synaptic variable taking both positive and negative values, then the bound on synaptic strengths at zero must be considered. Previous models make two proposals that avoid this rectification nonlinearity. One proposal is to study the difference between the strengths of two separate, initially equivalent excitatory projections innervating a single target structure (Miller et al. 1986, 1989; Miller 1989a). This Neurul Computation 2, 321-333 (7990)
0 1990
Massachusetts Institute of Technology
Kenneth D. Miller
322
difference in strengths is a synaptic variable that may take both positive and negative values. An alternative proposal is to study the sum of the strengths of two input projections, one excitatory, one inhibitory, that are statistically indistinguishablefrom one another in their connectivities and activities (Linsker 1986). The proposal to study the difference between the strengths of two equivalent excitatory projections is motivated by study of the visual system of higher mammals. Examples in that system include the projections from the lateral geniculate nucleus to the visual cortex of inputs serving the left and right eyes (Miller et al. 1989) (reviewed in Miller and Stryker 1990) or of inputs with on-center and off-center receptive fields (Miller 1989a). Examples exist in many other systems (briefly reviewed in Miller 1990). Assuming that the difference between the two projections is initially small, the early development of the difference can be described by equations linearized about the uniform condition of complete equality of the two projections. This can allow linear equations to be used to study aspects of early development in the presence of more general nonlinearities. This paper presents the derivation of previously studied simple, linear Hebbian equations, beginning from a nonlinear Hebbian model in the presence of two equivalent excitatory input projections. The outcome of this derivation is contrasted with that resulting from equivalent excitatory and inhibitory projections. Applications to other models are then discussed. 1 Assumptions
The derivation depends on the following assumptions:
A1
There are two modifiable input projections to a single output layer. The two input projections are equivalent in the following sense: 0
0
There is a topographic mapping that is identicaI for the two input layers: Each of the two input projections represent the same topographic coordinates, and the two project in an overlapping, continuous manner to the output layer. The statistics of neuronal activation are identical within each projection (N.B. the correlations between the two projections may be quite different from those within each projection);
A2
Synaptic modification occurs via a Hebb rule in which the roles of output cell activity and that of input activity are mathematically separable;
A3
The activity of an output cell depends (nonlinearly) only on the summed input to the cell.
Derivation of Linear Hebbian Equations
323
In addition, the following assumptions are made for simplicity. For the most part, they can be relaxed in a straightforward manner, at the cost of more complicated equations:
A4
The Hebb rule and the output activation rule are taken to be instantaneous, ignoring time delays. [Instantaneous rules follow from more complicated rules in the limit in which input patterns are sustained for long times compared to dynamic relaxation times. This limit appears likely to be applicable to visual cortex, where geniculate inputs typically fire in bursts sustained over many tens or hundreds of milliseconds (Creutzfeldt and Ito 1968)];
A5
The statistics of neuronal activation are time invariant;
A6
There are lateral interconnections in the output layer that are time invariant;
A7
The input and output layers are two-dimensional and spatially homogeneous;
AS
The topographic mapping from input to output layers is linear and isotropic.
2 Notation
We let Roman letters (z, y, z , . . .) label topographic location in the output layer, and Greek letters ( a ,/3, y,. . .) label topographic location in each of the input layers. We use the labels 1 and 2 to refer to the two input projections. We define the following functions: 1. o ( z , t ) :activity ( e g , firing rate, or membrane potential) of output cell at location z at time t; 2.
Z'(Q, t ) ,i2(a,t ) : activity of input of type 1 or 2, respectively, from location Q at time t;
3. A(z-a): synaptic density or "arbor" function, describing connectivity from the input layer to the output layer. This tells the number of synapses from an input with topographic location a onto the output cell with topographic location z. This is assumed time independent and independent of projection type;
4. s ~ ( z , a , t ) , s ~ ( z , a ,strength t): of the kth synapse of type 1 or 2, respectively, from the input at Q to the output cell at IC at time t. There are A ( . - a ) such synapses of each type; 5. S'(z,a , t ) ,S2(z,Q, t ) : total synaptic strength at time t from the input of type 1 or 2, respectively, at location a, to the output cell at IC. S'(z, Q, t ) = sL(z,Q, t ) [and similarly for S2(z,a , t ) ] ;
xk
Kenneth D. Miller
324
6. B ( x - y): intracortical connectivity function, describing total (timeinvariant) synaptic strength from the output cell at y to the output cell at x. B depends only on z - y by assumption A7 of spatial homogeneity. 3 Derivation of Linear Hebbian Equations from a Nonlinear Hebbian
Rule The Hebbian equation for the development of a single type 1 synapse sk from a to x can, by assumptions A2 and A4, be written d s k ,f f , t ) dt
= Ah, [o(x,t)] hi
[ZI(a,t)]- E - ~ s ~ ( x , c Y , ~ )
subject to 0 I s: 5 s,,
(3.1)
We assume that h, is a differentiablefunction, but h, and hi are otherwise arbitrary functions incorporating nonlinearities in the plasticity rule. A, E , and y are constants. Summing over all type 1 synapses from LY to x yields dS'(z, a ,t )
dt
=
AA(x - a)h, [ o ( x ,t ) ]h, [i'(n,t ) ]
-EA(x - a ) - yS'(rc, a,t ) subject to 0 2 S'(z,a , t ) 5 smaxA(z- a )
(3.2)
(and similarly for 5''). There are small differences between equations 3.1 and 3.2 when some but not all synapses sk have reached saturation. We will be concerned with the early development of a pattern, before synapses saturate, and so ignore these differences. We will omit explicit mention of the saturation limits hereafter. Define the direct input to a cell as O(x,t) f Cs{S*(x,P,t)fi[i'(P,t)]+ S2(x, 0,t ) f i [ i 2 ( pt,) ] } The . nonlinear activation rule is, by assumptions A3 and A4, (3.3)
f, and g are assumed to be differentiable functions, but they and fi are otherwise arbitrary functions incorporating the nonlinearities in the activation rules. We make the following nontrivial assumption: A9
For each input vector e(t),equation 3.3 defines a unique output vector o ( t ) .
Biologically, this is the assumption that the inputs determine the state of the outputs. Mathematically, this can be motivated by studies of the
Derivation of Linear Hebbian Equations
325
Hartline-Ratliff equation (Hadeler and Kuhn 1987).' With this assumption, o(z, t ) can be regarded as a function of the variables 8(y, t ) for varying y. We now transform from the variables S' and S2 to sum and difference variables. Define the following:
+
Note that 8(z, t ) = Bs(z,t ) BD(z,t ) . The Hebb rule for the difference, S D = S' - S2 is, from equation 3.2, dSD(zl
dt
t,
= XA(z -
a)ho[o(z,t ) ]hD(a,t ) - ySD(z,Q, t )
(3.5)
S D is a synaptic variable that can take on both positive and negative values, and whose initial values are near zero. We will develop a linear equation for S D by linearizing equation 3.5 about the uniform condition S D = 0. We will accomplish this by expanding equation 3.5 about O D = 0 to first order in OD. Let os(z, t ) be the solution of
Then, letting a prime signify the derivative of a function, dSD(z,a, t )
dt
=
XA(z - a ) h y ( a ,t ) {ho[o'(x, t ) ]
-
y s D ( z Q, , t ) + o [(eD)*]
(3.7)
'The HartlineRatliff equation is equation 3.3 for g(x) = {r, s 1 0; 0, x < 0) and f,,(x) = x. That equation has a unique output for every input, for symmetric B, iff 1 - B is positive definite; a more general condition for B nonsymmetric can also be derived (Hadeler and Kuhn 1987).
Kenneth D. Miller
326
Letting g”(z, t ) = g’ {Bs(z,t ) of o(z, t ) is
+ CyB ( z
-
y)fo[os(y, t ) ] } the , derivative
(3.8) where 1 is the identity matrix, B is the matrix with elements &, = B ( z - y)fA[os(y, t)]g‘s(y, t ) ,and [. . .Ixy means the xy element of the matrix in brackets. Letting Z(z, y, t ) = [l
+ B + (By + . . .]“Y
(3.9) (3.10) (3.11)
we find that equation 3.7 becomes, to first order in O D ,
This equation can be interpreted intuitively. The first term is the Hebbian term of equation 3.5 in which the output cell’s activity has been replaced by the activity it would have if OD = 0, that is, if S D = 0. The last term is the Hebbian term with the output cell’s activity replaced by the first order change in that activity due to the fact that OD # 0. In this term, M ( z , t ) measures the degree to which, near OD = 0, the activity of the output cell at z can be significantly modified, for purposes of the Hebb rule, by changes in the total input it receives. 1(z,y, t ) measures the change in the total input to the cell at z due to changes in the direct input to the cell at y. @‘(a, p,t)S”(y, p, t ) incorporates both the change in the direct input to the cell at y due to the fact that OD # 0, and the difference in the activities of the inputs from a that are being modified.
4 Averaging
Given some statistical distribution of input patterns i(a,t),equation 3.12 is a stochastic differential equation. To transform it to a deterministic equation, we average it over input activity patterns. The result is an equation
Derivation of Linear Hebbian Equations
327
for the mean value (S”), averaged over input patterns. The right-hand side of the equation consists of an infinite series of terms, corresponding to the various cumulants of the stochastic operators of equation 3.12 (Keller 1977; Miller 1989b). However, when X and y are sufficiently small that S D can be considered constant over a period in which all input activity patterns are sampled, only the first term is significant. We restrict attention to that term. After averaging, the first term on the right side of equation 3.12 yields zero, by equality of the two input projections. The lowest order term resulting from averaging of the last term is
XA(z - a )
c
(@(XI
t)Z(.l
Y,t)C”(a,P, t ) )SD(Y,P , t )
Y.0
where we retain the notation S D for (S”). We now assume: We can approximate ( M ( z ) Z ( z , y ) C ” ( a , P ) )by (M(z)I(z,y))
A10
(C D ( a8, ) ) . Assumption A10 will be true if the sum of the two eyes‘ inputs is statistically independent of the difference between the two eyes‘ inputs. By equivalence of the two input projections the sum and difference are independent at the level of two-point interactions: ( S s S D )= (S’S’)-(S’S’) = 0 = ( S s ) ( S D ) By . assumption A7 of spatial homogeneity, ( M ( z ) J ( z , y ) ) can depend only on z - y, while (C”(a,S)) can depend only on a - p. With these assumptions, then, the linearized version of this nonlinear model becomes
-ySD(x, a, t )
(4.1)
where
and
Note that the nonlinear functions referring to the output cell, h,, fo, and 9, enter into equation 4.1 only in terms of their derivatives. This reflects the fact that the base level of output activity, os, makes no
328
Kenneth D. Miller
contribution to the development of the difference S D because the first term of equation 3.12 averages to 0. Only the alterations in output activity induced by eDcontribute to the development of SD. We have not yet achieved a linear equation for development. I(. - y) depends on S s through the derivatives of h,, f o , and g. Because the equation for Ss remains nonlinear, equation 4.1 is actually part of a coupled nonlinear system. Intuitively, the sum Ss is primarily responsible for the activation of output cells when S D is small. S s therefore serves to "gate" the transmission of influence across the output layer: the cells at z and at y must both be activated within their dynamic range, so that small changes in their inputs cause changes in their responses or in their contribution to Hebbian plasticity, in order for Z(x - y ) to be nonzero. To render the equation linear, we must assume All
The shape of I(x - y) does not vary significantly during the early, linear development of S D .
Changes in the amplitude of I(" - y) will alter only the speed of development, not its outcome, and can be ignored. Assumption A l l can be loosely motivated by noting that Ss is approximately spatially uniform, so that B ( z - y) should be the primary source of spatial structure in I(" - y)? and that cortical development may act to keep cortical cells operating within their dynamic range. 5 Comparison to the Sum of an Excitatory and an Inhibitory Projection
An alternative proposal to that developed here is to study the sum of the strengths of two indistinguishable input projections, one excitatory and one inhibitory (Linsker 1986). This case is mathematically distinct from the sum of two equivalent excitatory projections, because the Hebb rule does not change sign for the inhibitory population relative to the excitatory population. That is, in response to correlated activity of the preand postsynaptic cells, inhibitory synapses become weaker, not stronger, by a Hebbian rule. To understand the significance of this distinction, let S2 now represent an inhibitory projection, so that S2 I 0. Then the variable that is initially small, and in which we expand in order to linearize, is the synaptic sum Ss, rather than the difference S D . Define oD analogously *The correlation structure of the summed inputs can also contribute to I(. - y), since cortical cells with separation x - y must be coactivated for I ( x - y) to be nonzero. Arguments can be made that the relevant lengths in I(.-y) appear smaller than an arbor diameter (e.g., see Miller 1990; Miller and Stryker 1990), and thus are on a scale over which cortical cells receive coactivated inputs regardless of input correlation structure.
Derivation of Linear Hebbian Equations
329
to the definition of os in equation 3.6, with OD in place of Bs. Let hC(a,t ) z hi[Z'(a,t)] hi[i2((r,t)]. Then one finds in place of equation 3.12
+
d S S ( x ,a, t ) dt
=
XA(2 - .)h0 [oD(,, t ) ]@ ( a ,t ) 2~A(z a) +XA(X- a)lMs(z, t ) Is(,, y, t ) -ySs(x, a, t)
-
C Yi;l
C S b ,P, W S ( y , P, t )
(5.1)
where C s = 112 hff:, and M S and I s are defined like M and I except that derivatives are taken at OD and oD rather than at 0' and os. Unlike equation 3.12, the first term of equation 5.1 does not disappear after averaging. This means that the development of S s depends upon a Hebbian coupling between the summed input activities, and the output cell's activity in response to SD (the activity the cell would have if S s = 0). Thus, direct Hebbian couplings to both S D and S s drive the initial development of S s , rendering it difficult to describe the dynamics by a simple linear equation like equation 4.1. In Linsker (1986), two assumptions were made that together lead to the disappearance of this first term. First, the output functions h,, fo, and g were taken to be linear. This causes the first term to be proportional to CD. To present the second assumption, we define correlation functions C", C'*, C2', C22among and between the two input projections by C J k ( a - D )= (h,[zJ(a, t ) ]f,[z'((P, t ) ] ) . By equivalence of the two projections, C" = C22and C" = C2'. Then C D = C" - C". The second assumption was that correlations between the two projections are identical to those within the two projections; that is, CI2 = C". This means that C D = 0, and so the first term disappears. This second assumption more generally ensures that S D does not change in time, prior to synaptic saturation. Equation 5.1 also differs from equation 3.12 in implicitly containing two additional parameters that Linsker named Icl and k2. kl is the decay parameter E . The parameter k2 arises as follows. One can reexpress the "correlation functions" CJ in terms of "covariance functions" Q J k C J k= QJk k2, where
+
Q"(a
-PI
( h [ z ' ( a , t ) ] )()h [ b k ( P ) t ) ]
= ((hi [ z 3 ( ~ , 1 ) ]-
and kZ =
j
( h , [iJ(a.t)](ft[ i " ( P . t ) ] )
I
k2 is independent of the choice of j and k. The Qs have the advantage that lim(,-p)+?cQjk(a- 0)= 0; if fi and h, are linear, the Qs are true covariance functions. The correlation function relevant to the sum of an inhibitory and an excitatory projection is C s = C1' +C12 = Qn+Q1' f 2 k z .
330
Kenneth D. Miller
In contrast, the correlation function relevant to the difference between two excitatory projections is C D = C" - C12 = Q'l - & I 2 , which has no k2 dependence. Thus, the parameters Icl and k2 do not arise in considering the difference between two excitatory input projections, because they are identical for each input projection and thus disapear from the equation for the difference; whereas these parameters do arise in considering the sum of an excitatory and an inhibitory projection. In MacKay and Miller (1990), it was shown that these parameters can significantly alter the dynamics, and play crucial roles in many of the results of Linsker (1986). In summary, the proposal to study equivalent excitatory and inhibitory projections does not robustly yield a linear equation in the presence of nonlinearities in the output functions h,, f,, and g. Even in the absence of such additional nonlinearities, it can lead to different dynamic outcomes than the proposal studied here. It also is biologically problematic. It would not apply straightforwardly to such feedforward projections as the retinogeniculate and geniculocortical projections in the mammalian visual system, which are exclilsively excitatory. Where both inhibitory and excitatory populations do exist, the two are not likely to be equivalent. For example, inhibitory neurons are often interneurons that, when active, inhibit nearby excitatory neurons, potentially rendering the three correlation structures C", C12,and CZ2quite distinct; connectivity of such interneurons is also distinct from that of nearby excitatory cells (Toyama et al. 1981; Singer 1977). Similarly, while there is extensive evidence that excitatory synapses onto excitatory cells may be modified in a Hebbian manner (Nicoll et al. 1988), current evidence suggests that there may be little modification of inhibitory synapses, or of excitatory synapses onto the aspinous inhibitory interneurons, under the same stimulus paradigms (Abraham et al. 1987; Grfith et al. 1986; Singer 1977).
6 Connections to Previous Models
Equation 4.1 is that studied in Miller et al. (1986,1989) and Miller (1989a). It is also formally equivalent to that studied in Linsker (1986) except for the absence of the two parameters kl and k2.3 The current approach allows the analysis of other previous models. For example, in the model of Willshaw and von der Malsburg (19761, g was taken to be a linear threshold function [g(z) = 2 - 6 for z > 6, where 6 is a constant threshold; g(z) = 0 otherwise]; this can be approximated by a differentiable function. The functions h, and f, were taken to be the identity, while h, and fi were taken to be step functions: 1 if the 3Also, lateral interactions in the output layer were not introduced until the final layer in Linsker (1986). They were then introduced perturbatively, so that I was approximated by 1 + 8. B was referred to as f in that paper.
Derivation of Linear Hebbian Equations
331
input was active, 0 if it was not. A time-dependent activation rule was used, but input activations were always sustained until a steady state was reached so that this rule is equivalent to equation 3.3. These rules were applied only to a single input projection, but the present analysis allows examination of the case of two input projections. From equation 4.2, it can be seen that choosing g to be a linear threshold function has two intuitively obvious effects: (1)on the average, patterns for which 6’ would fail to bring the output cell at z above threshold do not cause any modification of S D onto that cell; ( 2 ) such patterns also make no average contribution to I(y - x) for all 9, that is, if the cell at E. is not above threshold it cannot influence plasticity on the cell at y. More generally, given an ensemble of input patterns and the initial distribution of Ss, the functions I(. - y) and C D ( a- p ) could be calculated explicitly from equation 4.2 and 4.3, respectively. Similarly, Hopfield (1984) proposed a neuronal activation rule in which fi and fo are taken to be sigmoidal functions and g is the identity, while many current models (i.e., Rumelhart et al. 1986) take f L and fo to be the identity, but take g to be sigmoidal. Again, such activation rules can be analyzed within the current framework. 7 Conclusions
It is intuitively appealing to think that activity-dependent neural development may be described in terms of functions A, I , and C that describe, respectively, connectivity from input to output layer (”arbors”), intralaminar connectivity within the output layer, and correlations in activity within the input layer. I have shown that formulation of linear equations in terms of such functions can be sensible for modeling aspects of early neural development in the presence of nonlinearities in the rules governing cortical activation and Hebbian plasticity. The functions I and C can be expressed in terms of the ensemble of input activities and the functions describing cortical activation and plasticity. This gives a more general relevance to results obtained elsewhere characterizing the outcome of development under equation 4.1 in terms of these functions (Miller et al. 1989; Miller 1989a; MacKay and Miller 1990). The current formulation is of course extremely simplified. Notable simplifications include the lack of plasticity in intralaminar connections in the output layer, the instantaneous nature of the equations, the assumption of spatial homogeneity, and, more generally, the lack of any attempt at biophysical realism. The derivation requires several additional assumptions whose validities are difficult to evaluate. The current effort provides a unified framework for analyzing a large class of previous models. It is encouraging that the resulting linear model is sufficient to explain many features of cortical development (Miller et al. 1989; Miller 1989a); it will be of interest, as more complex models are formulated,
332
Kenneth D. Miller
to see the degree to which they force changes in the basic framework analyzed here.
Acknowledgments I thank J. B. Keller for suggesting to me long ago that the ocular dominance problem should be approached by studying the early development of a n ocular dominance pattern and linearizing about the nearly uniform initial condition. I thank M. I? Stryker for supporting this work, which was performed in his laboratory. I a m supported by an N.E.I. Fellowship and by a Human Frontiers Science Program Grant to M. P. Stryker (T. Tsumoto, Coordinator). I thank M. P. Stryker, D. J. C. MacKay, and especially the action editor for helpful comments.
References Abraham, W. C., Gustafsson, B., and Wigstrom, H. 1987. Long-term potentiation involves enhanced synaptic excitation relative to synaptic inhibition in guinea-pig hippocampus. J. Physiol. (London) 394,367-380. Creutzfeldt, O., and Ito, M. 1968. Functional synaptic organization of primary visual cortex neurones in the cat. Exp. Brain Res. 6, 324-352. Grifith, W. H., Brown, T. H., and Johnston, D. 1986. Voltage-clamp analysis of synaptic inhibition during long-term potentiation in hippocampus. I. Neurophys. 55, 767-775. Hadeler, K. P., and Kuhn, D. 1987. Stationary states of the Hartline-Ratliff model. Biol. Cybern. 56, 411417. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81,3088-3092. Keller, J. B. 1977. Effective behavior of heterogeneous media. In Statistical Mechanics and Statistical Methods in Theory and Application, U. Landman, ed., pp. 631-644. Plenum Press, New York. Linsker, R. 1986. From basic network principles to neural architecture (series). Proc. Natl. Acad. Sci. U.S.A. 83, 7508-7512, 8390-8394, 8779-8783. MacKay, D. J. C., and Miller, K. D. Analysis of Linsker’s simulations of Hebbian rules. Neural Comp. 2,169-182. Miller, K. D. 1989a. Orientation-selective cells can emerge from a Hebbian mechanism through interactions between ON- and OFF-center inputs. SOC. Neurosci. Abst. 15, 794. Miller, K. D. 198913. Correlation-based mechanisms in visual cortex: Theoretical and experimental sfudies. Ph.D. Thesis, Stanford University Medical School (University Microfilms, AM Arbor). Miller, K. D. 1990. Correlation-based mechanisms of neural development. In Neuroscience and Connectionist Theory, M.A. Gluck and D.E. Rumelhart, eds., pp. 267-353. Lawrence Erlbaum, Hillsdale, NJ.
Derivation of Linear Hebbian Equations
333
Miller, K. D., Keller, J. B., and Stryker, M. P. 1986. Models for the formation of ocular dominance columns solved by linear stability analysis. SOC.Neurosci. Abst. 12, 1373. Miller, K. D., Keller, J. B., and Stryker, M. P. 1989. Ocular dominance column development: Analysis and simulation. Science 245, 605-615. Miller, K. D., and Stryker, M. P. 1990. Ocular dominance column formation: Mechanisms and models. In Connectionist Modeling and Brain Function: The Developing Interface, S. J- Hanson and C. R. Olson, eds., pp. 255-350. MIT Press/Bradford Books, Cambridge, MA. Nicoll, R. A., Kauer, J. A., and Malenka, R. C. 1988. The current excitement in long-term potentiation. Neuron 1, 97-103. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by back-propagating errors. Nature 323,533-536. Singer, W. 1977. Effects of monocular deprivation on excitatory and inhibitory pathways in cat striate cortex. Exp. Brain Res. 30, 25-41. Toyama, K., Kimura, M., and Tanaka, K. 1981. Organization of cat visual cortex as investigated by cross-correlation techniques. J. Neurophysiol. 46, 202-213. Willshaw, D. J., and von der Malsburg, C. 1976. How patterned neural connections can be set up by self-organization. Proc. R. Soc. London Ser. B 194, 431-445.
Received 23 January 90; accepted 9 June 90.
Communicated by John Pearson
Spontaneous Development of Modularity in Simple Cortical Models Alex Chernjavsky Molecular and Celltrlar Physiology, Beckman Center, Stanford University, Stanford, CA 94305-5426 USA
John Moody* Yale Computer Science, P.O. Box 2258 Yale Station, New Haven, CT 06520 U S A
The existence of modular structures in the organization of nervous systems (e.g., cortical columns, patches of neostriatum, and olfactory glomeruli) is well known. However, the detailed dynamic mechanisms by which such structures develop remain a mystery. We propose a mechanism for the formation of modular structures that utilizes a combination of intrinsic network dynamics and Hebbian learning. Specifically, we show that under certain conditions, layered networks can support spontaneous localized activity patterns, which we call collective excitations, even in the absence of localized or spatially correlated afferent stimulation. These collective excitations can then induce the formation of modular structures in both the afferent and lateral connections via a Hebbian learning mechanism. The networks we consider are spatially homogeneous before learning, but the spontaneous emergence of localized collective excitations and the consequent development of modules in the connection patterns break translational symmetry. The essential conditions required to support collective excitations include internal units with sufficiently high gains and certain patterns of lateral connectivity. Our proposed mechanism is likely to play a role in understanding more complex (and more biologically realistic) systems. 1 Modularity in Nervous Systems Modular organization exists throughout the nervous system on many different spatial scales. On the very small scale, synapses may be clustered into functional groups on dendrites. On the very large scale, the brain as a whole is composed of many anatomically and functionally distinct regions. At intermediate scales, those of networks and maps, the *Please address correspondence to John Moody.
Neural Computation
2, 334-354 (1990) @ 1990 Massachusetts Institute of Technology
Spontaneous Development of Modularity
335
cortex exhibits columnar structures (see Mountcastle 1957, 1978). Many modality-specific variations of intermediate scale modular organization are known. Examples in neocortex include orientation selective columns, ocular dominance columns, color sensitive blobs, somatosensory barrels, and the frontal eye fields of association cortex. Examples in other brain regions include the patches of neostriatum and the olfactory glomeruli. Other modular structures have been hypothesized, such as cell assemblies (see Braitenberg 19781, colonies in motor cortex, minicolumns, and neuronal groups. These modular structures can be divided into two distinct classes, functional modules and anatomical modules. Functional modules are structures whose presence is inferred strictly on the basis of physiology. Their existence is due most likely to modular organization in the patterns of synaptic efficacies; they do not exhibit corresponding patterns in the distribution of afferent or lateral connections or of cell bodies. (Patterns of synaptic efficacy are not anatomically detectable with present technology.) Functional modules include the orientation selective columns, the frontal eye fields of association cortex, and the color sensitive blobs (convincing anatomical correlates of the cytochrome oxidase activity patterns have not yet been found). Cell assemblies, the colonies of motor cortex, and neuronal groups (should these structures exist) are also candidates for functional modules. Anatomical modules are structures which, in addition to having a functional role, are also detectable anatomically by a clear organization of the afferent connections, the dendritic arbors, or the distribution of cell bodies and neuropil. Anatomical modules include the ocular dominance columns, somatosensory barrels, patches of neostriatum, and olfactory glomeruli. A complete biophysical picture of the development of modular structures is still unavailable. However, two general statements about the developmental process can be made. First, the development of modules typically occurs by the differentiation of an initially homogeneous structure. This fact has been demonstrated convincingly for certain anatomical modular structures such as the somatosensory barrels (see Rice et al. 1985 and the review by Kaas et al. 1983), the ocular dominance columns (see Stryker 1986 for a review) and the patches of neostriatum (Goldman-Rakic 1981). We conjecture that functional modules also develop from an initially homogeneous structure. Second, it is well established that the form of afferent electrical activity is crucial for the development of certain modular structures in afferent connections. Removal or modification of afferent sensory stimulation causes abnormal development in several well-known systems. Examples include the somatosensory barrels (see Kaas et al. 1983) and the ocular dominance columns (see Stryker 1986). These findings, along with others
336
Alex Chernjavsky and John Moody
(see Kalil19891, support the conjecture that an activity-dependent Hebblike mechanism is operative in the developmental process. It should be noted that much of the existing evidence for the development and form of modular organization concerns only the afferent sensory projection patterns in sensory areas. The development and form of modular structures in the patterns of lateral intrinsic connections of cortex (both within and between modules) have received less attention from experimenters, but are of great interest. Previous attempts to model modular development (e.g., Pearson et al. 1987; Miller et al. 1989) have focused on the idea that spatially correlated sensory stimuli drive the formation of modular structures in sensory areas. While spatially correlated afferent stimulation will certainly encourage modular development, we believe that additional factors intrinsic to the network architecture and dynamics must be present to ensure that stable modules of a characteristic size will form. The importance of intrinsic factors in this development is emphasized when one observes two important facts. First, connections from thalamic (sensory) afferents account for only about 0.1% of all synapses of neocortex (Braitenberg 1978) and on the order of 10% of the synapses in layer IV. Local intrinsic connections and cortico-cortical connections account for the remaining 90% (layer IV) to 99.9% (overall) of neocortical synapses. Thus, the role of afferent sensory activity has been most likely over-emphasized in previous models of activity-dependent modular development. Second, columnar structures throughout the neocortex all have a roughly uniform size of a few hundred microns. This uniformity occurs in spite of the fact that the various sensory modalities have naturally different length scales of correlated activity in their afferent stimulation patterns. Thus, while correlated afferent stimulation probably influences the development of modules, the effects of internal network dynamics must contribute crucially to the observed developmental process. Specifically, the length scale on which modules form must be determined by factors intrinsic to the structure of cortex. These observations provide the motivation for our operating hypothesis below. 2 Operating Hypothesis and Modeling Approach Our hypothesis in this work is that localized activity patterns in an initially homogeneous layer of cells induce the development of modular structure within the layer via an activity-dependent Hebb-like mechanism. We further hypothesize that the emergence of localized activity patterns on a specific spatial scale within a layer may be due to the properties of the intrinsic network dynamics alone and does not necessarily depend on the system receiving localized patterns of afferent activity.
Spontaneous Development of Modularity
337
Finally, we hypothesize that a single mechanism can drive the formation of modules in both afferent and lateral connections. Our investigation therefore has two parts. First, we show that localized patterns of activity on a preferred spatial scale, which we call collective excitations, spontaneously emerge in homogeneous networks with appropriate lateral connectivity and cellular response properties when driven with arbitrary stimulus (see Section 3 and Moody 1990). Second, we show that these collective excitations induce the formation of modular structures in both the afferent and lateral connectivity patterns when coupled to a Hebbian learning mechanism (see Section 4). [In Sections 5 and 6, we provide a discussion of our results and a comparison to related work.] The emergence of collective excitations at a preferred spatial scale in a homogeneous network breaks translational symmetry and is an example of spontaneous symmetry breaking. The Hebbian learning freezes the modular structure into the connection patterns. The time scale of collective excitations is short, while the Hebbian learning process occurs over a longer time scale. The spontaneous symmetry breaking mechanism is similar to that which drives pattern formation in reaction-diffusion systems (Turing 1952; Meinhardt 1982). Reaction-diffusion models have been applied to pattern formation in both biological and physical systems. One of the best known applications is to the development of zebra stripes and leopard spots (see Murray 1988). In the context of network dynamics, a model exhibiting spontaneous symmetry breaking has been proposed by Cowan (1982) to explain geometric visual hallucination patterns. Previous work by Pearson et al. (1987) demonstrated empirically that modularity emerged in simulations of an idealized but rather complex model of somatosensory cortex. The Pearson work was purely empirical and did not attempt to analyze theoretically why the modules developed. It provided an impetus, however, for our developing the theoretical results that we present here and in Moody (1990). As mentioned above, a major difference between our work and Pearson’s is that we do not assume spatially correlated afferent stimulation. Our work is thus intended to provide a possible theoretical foundation for the development of modularity as a direct result of network dynamics. Our proposed mechanism can model the formation of both functional and anatomical modules, although the interpretation of the simulations is different in the two cases (see section 4). We have limited our attention to simple models that we can analyze mathematically to identify the essential requirements for the formation of modules. To convince ourselves that both collective excitations and the consequent development of modules are somewhat universal, we have considered several different network models. All models exhibit collective excitations. We believe that more biologically realistic (and therefore more complicated) models will very likely exhibit similar behaviors.
338
Alex Chernjavsky and John Moody
The presentation here is an expanded version of that given in Chernjavsky and Moody (1990). 3 Network Dynamics: Collective Excitations
The analysis of network dynamics presented in this section is adapted from Moody (1990). Due to space limitations, we present here a detailed analysis of only the simplest models that exhibit collective excitations. All network models that we consider possess a single layer of receptor cells that provides input to a single internal layer of laterally connected cells. Two general classes of models are considered (see Fig. 1): additive models and shunting inhibition models. The additive models contain a single population of internal cells that makes both lateral excitatory and inhibitory connections. Both connection types are additive. The shunting inhibition models have two populations of cells in the internal layer: excitatory cells that make additive synaptic axonal contact with other cells and inhibitory cells that shunt the activities of excitatory cells. The additive models are further subdivided into models with linear internal units and models with nonlinear (particularly sigmoidal) internal units. The shunting inhibition models have linear excitatory units and sigmoidal inhibitory units. We have considered two variants of the shunting models, those with and without lateral excitatory connections. For simplicity and tractability, we have limited the use of nonlinear response functions to at most one cell population in all models. More elaborate network models could make greater use of nonlinearity, a greater variety of cell types (e.g., disinhibitory cells), or use more ornate connectivity patterns. However, such additional structure can only add richness to the network behavior and is not likely to remove the collective excitation phenomenon. 3.1 Dynamics for the Linear Additive Model. To elucidate the fundamental requirements for the spontaneous emergence of collective excitations, we now focus on the minimal model that exhibits the phenomenon, the linear additive model. This model is exactly solvable. As we will see, collective excitations will emerge provided that the appropriate lateral connectivity patterns are present and that the gains of the internal units are sufficiently high. These basic requirements will carry over to the nonlinear additive and shunting models. One kind of lateral connectivity pattern which supports the emergence of collective excitations is local excitation coupled with longer range inhibition. This kind of pattern is analogous to the local autocatalysis and longer range suppression found in reaction-diffusion models (Turing 1952; Meinhardt 1982).
Spontaneous Development of Modularity
339
Receptor Units
Internal Units
(4 Receptor Units
Excitatory Units
Inhibitory Units
u u u u u u u u u
Figure 1: Network models. (A) Additive model. (B) Shunting inhibition model. Only local segments of an extended network are shown. Both afferent and lateral connections in the extended networks are localized. Artwork after Pearson et al. (1987).
The network relaxation equations for the linear additive model are d
7 dd- xt
-K
1
f
C M
(3.1)
3
where R3 and E3 are the activities (firing rates) of the j t h receptor and internal cells, respectively, V , is the somatic potential of the rth internal cell, WGff and Wz’,”’are the afferent and lateral connections, respectively, and 7 , is the dynamic relaxation time. The somatic potentials and firing rates of the internal units are linearly related by E, = (V, - B ) / F where B is an offset or threshold and 6-I is the gain.
Alex Chernjavsky and John Moody
340
Figure 2 (A) Excitatory (upper dashed), inhibitory (lower dashed), and difference of gaussian (solid) lateral connection patterns. (B) Magnification functions for the linear additive model excitatory (upper dashed), inhibitory (lower dashed), and difference of gaussian (solid) lateral connection patterns. The steady-state solutions of the network equations can be solved exactly by reformulating the problem in the continuum limit (i H 5): d ~ , , d - V ( z=) - V ( x ) dt
A ( z )=
+ A ( z )+ J d y Wlat(x
f dy Waff(z
-
gr)R(y)
-
y)E(y)
(3.2) (3.3)
The functions R(y) and E ( v ) are activation densities in the receptor and internal layers, respectively. A ( z ) is the integrated input activation density to the internal layer. The functions Waff(z- y) and Wlat(x - y) are interpreted as connection densities. Note that the network is spatially homogeneous since the connection densities depend only on the relative separation of postsynaptic and presynaptic cells (z - y). Examples of lateral connectivity patterns W1"'(z - y) are shown in Figure 2A. These include local gaussian excitation, intermediate range gaussian inhibition, and a scaled difference of gaussians (DOG). The exact stationary solution ( d / d t ) V ( x ) = 0 of the continuum dynamics of equation 3.2 can be computed by fourier transforming the equations to the spatial frequency domain. The solution thereby obtained (for Q = 0) is E ( k ) = M ( k ) A ( k ) ,where the variable k is the spatial frequency and M ( k ) is the network magnification function: M(k)K
1 E
-W y k )
(3.4)
SpontaneousDevelopment of Modularity
341
Positive magnification factors correspond to stable modes. When the magnification function is large and positive, the network magnifies afferent activity structure on specific spatial scales. This occurs when the inverse gain t is sufficiently small and/or the fourier transform of the pattern of lateral connectivity Wlat(lc)has a peak at a nonzero frequency. Figure 28 shows magnification functions (plotted as a function of spatial scale 27r/k) corresponding to the lateral connectivity patterns shown in Figure 2A for a network with E = 1. Note that the gaussian excitatory and gaussian inhibitory connection patterns (which have total integrated weight i0.25) magnify structure at large spatial scales by factors of 1.33 and 0.80, respectively. The scaled DOG connectivity pattern (which has total weight 0) gives rise to no large scale or small scale magnification, but rather magnifies structure on an intermediate spatial scale of 17 cells. We illustrate the response of linear networks with unit gain 6 = 1 and different lateral connectivity patterns in Figure 3. The networks correspond to connectivities and magnification functions shown in Figure 2. Part A shows the response E ( z )of neutral, gaussian excitatory, and gaussian inhibitory networks to net afferent input A ( z ) generated from a random 1/ f 2 noise distribution. The neutral network (no lateral connections) yields the identity response to random input; the networks with the excitatory and inhibitory lateral connection patterns exhibit boosted and reduced response, respectively. Part B shows the emergence of collective excitations (solid) for the scaled DOG lateral connectivity. The resulting collective excitations have a typical period of about 17 cells, corresponding to the peak in the magnification function shown in Figure 2. Note that the positions of peaks and troughs of the collective excitations correspond approximately to local extrema in the random input (dashed). It is interesting to note that even though the individual components of the networks are all linear, the overall response of the interacting system (equation 3.4) is nonlinear in the recurrent lateral connection values Wlat. This collective nonlinearity of the system enables the large amplification of activity at a particular spatial scale, the collective excitations. Although the connectivity giving rise to the response in Figure 3B is a scaled sum of the connectivities of the excitatory and inhibitory networks of Figure 3A, the responses themselves do not add. Thus, while a superposition principle holds for the patterns of afferent stimulation and connectivity, superposition does not hold for the patterns of lateral connectivity. 3.2 Dynamics for the Nonlinear Additive Model. The nonlinear models, including the sigmoidal additive model and the shunting models, exhibit the collective excitation phenomenon as well. These models can not be solved analytically in closed form, so solutions to the dynamic equations must be obtained by iterative numerical procedures. A method of analyzing the fixed points and stability of these models is to solve them exactly for homogeneous inputs A0 and then consider
Alex Chernjavsky and John Moody
342
A
2-o
!
Lateral Excitation or Inhibition
I
I
I
I
B: Collective Excitations
1.0
I
I
h
Figure 3: Response of a linear network to random input. (A) Response V(z) of neutral (no lateral connections, dashed line), lateral excitatory (upper solid), and lateral inhibitory (lower solid) networks. For the neutral network (which has unit gain and zero threshold), the response V(z) for unit z equals its total afferent stimulation A ( z ) . (B)Collective excitations V(z) (solid) as response to random total afferent stimulation A(z) (dashed) in network with DOG lateral connectivity.
small perturbations to the input A. H Ao+&A(z). As for the linear model, the perturbative analysis proceeds by going to the continuum limit and computing the network magnification function in the spatial frequency domain. The nonlinear additive model differs from the linear additive model only in that the internal units have response
E
= g(V)
v-e
o+(-)
(3.5)
where o+ is an increasing sigmoidal function, for example the standard logistic function. The dynamics for perturbations SE(z) = g'(V,)SV(z)about the homogeneous network fixed point Eo = g(V0) are (in the spatial frequency domain) d dt
~d-bV(Fc) = -J(k)&V(k)+ &A(k)
(3.6)
Spontaneous Development of Modularity
343
where the function J ( k ) (which determines the stability of the perturbed modes) is J ( k ) = [l - W'"'(k)g'(L())]
(3.7)
The magnitude of the perturbed activations 6 E ( k ) = M ( k ) G A ( k )is given by the magnification function:
Collective excitations correspond to stable modes with large positive magnification functions or perturbatively unstable modes with negative magnification functions. As in the linear model, these occur for large values of the fourier transform of the lateral connectivity patterns WIat(k)or when the gain 6-l becomes sufficiently large. Unlike in the linear model, the perturbatively unstable modes I M ( k ) < 01 will not grow without bound, because the sigmoidal response function g(V)limits internal unit activities to a finite range. A more complete analysis along with simulations of the network dynamics are presented in Moody (1990). The results are qualitatively similar to those presented for the linear additive model earlier. 3.3 Formulation of the Nonlinear Shunting Model. The nonlinear shunting model has two populations of internal units: linear excitatory units and sigmoidal inhibitory units. The relaxation dynamics of the excitatory units are
(3.9) while the inhibitory units obey the equations d
- -
dt ' -
(3.10)
Here, the responses of the excitatory units are E, = V,, while the inhibitory units have sigmoidal response 1' = g(Qz) = m+[(Qz - B ) / E ] . We"' and "Inh are the lateral excitatory and inhibitory connection patterns, respectively. A perturbative analysis similar to that summarized above demonstrates that the magnification function again depends on the gain 6-l of the internal units and on the fourier transforms of the lateral connectivity patterns. Of particular interest is that collective excitations can form even in the absence of lateral excitatory connections; a symmetric, bimodal pattern of inhibitory connections is sufficient. See Moody (1990) for a detailed presentation.
344
Alex Chernjavsky and John Moody
4 Hebbian Learning: Development of Modularity
The presence of collective excitations in the network dynamics enables the development of modular structures via Hebbian learning. Normally, Hebbian learning refers to the modification of individual synaptic efficacies via a local, activity-dependent process. However, simple Hebb rules may also be used as a proxy to model activity-dependent developmental processes on scales larger than that of a single synapse of even a single neuron. On larger scales, simple Hebb rules can be thought of as summarizing, emulating, or capturing the general, collective, or qualitative behavior of more detailed and complicated developmental processes involved in the formation of afferent fan-out patterns, dendritic arbors, and the establishment of synaptic contacts on the scales of columns and maps. A detailed model of such processes is beyond the range of current experimental knowledge, so simplifying assumptions must be made. Our simulations, which make use of simple Hebb rules, can thus be thought of as modeling both functional and anatomical modules (see our definitions in section 1). Narrowly interpreted, our simulations model the formation of functional modules, modular patterns in the synaptic efficacies in an otherwise homogeneous connectivity structure. More broadly interpreted, our simulations model the development of a spatially inhomogeneous modular connectivity architecture that is readily detectable using standard anatomical techniques. In the narrow interpretation, we model the variation of individual synaptic efficacies. In the broad interpretation, we model the spatiotemporal variation of connection density. From a mathematical standpoint, a discrete approximation of connection density appropriate for numerical simulation is equivalent to a distribution of individual synapses. We have succeeded in simulating the development of both afferent and lateral modules in our various models. Our linear and nonlinear additive models incorporate only plastic afferent excitatory connections. We kept the lateral connections in the additive models fixed to preserve the dynamics that yield collective excitations. The shunting model, on the other hand, has in addition plastic lateral excitatory connections. This is because the collective excitations in the shunting model can occur by virtue of the fixed inhibitory lateral connections alone. In all network architectures which support collective excitations, we have observed either the development of modular structure in the afferent connections (afferent modules), the development of modular structure in the lateral connections (lateral modules), or both. In models that do not exhibit strong collective excitations, we were unable to induce the development of modular structures. We believe that our linear model is the minimal model that supports the development of afferent modules and that our shunting model is
Spontaneous Development of Modularity
345
probably the minimal model that will support consistent development of lateral modules. 4.1 Mathematical Formulation of Hebb Rules. In our networks, the plastic excitatory connection values are restricted to the range W E [0,1]. The homogeneous initial conditions for all connection values are W = 0.5. We have considered several variants of Hebbian learning. These include the simple Hebb/decay rule
and the Hebb/anti-Hebb rule
where M z and NJ are the post- and presynaptic activities, respectively, and THebb is the timescale for learning. In our simulations, the decay constants fl and y are chosen to be approximately equal to the expected values and respectively, averaged over the whole network. The MN choice for P makes the Hebb/decay rule similar to the covariance type rule of Sejnowski (1977). Thus, positive covariance between pre- and postsynaptic activities tends to strengthen a weight while negative covariance tends to weaken it. The difference between our Hebb/decay rule and the covariance rule is that our /3 is a global constant rather than being locally determined. We also tested the effect of including an additional spontaneous symmetry breaking (SSB) term that favors saturation of the weights to their extremal values (0,l):
an
d dt
THHebb-WzJ = any
Hebb rule f
K ( w z J - 0.5)
n,
(4.3)
with K a small positive constant. Similar SSB terms have been employed by Kammen and Yuille (1988).
4.2 Simulation Results. In comparison of the various Hebb rules, we found that the simple Hebb/decay rule yielded the best and most consistent results. The Hebb/anti-Hebb rule was found to give rise to the development of only lateral modules in the shunting model. Afferent modules did not develop with this rule, because it makes the weights follow the recent average presynaptic activities, and the receptor input to our networks is spatially uncorrelated. However, the Hebb/decay rule enabled the development of afferent modules in all models as well as lateral modules in the shunting model. Furthermore, we found that the additional SSB term of equation 4.3 did not qualitatively change the results of our simulations for either the Hebb/decay or Hebb/anti-Hebb
346
Alex Chernjavsky and John Moody
rules. This extra term was therefore not required to ensure the stability of the modular structures that developed. We thus present simulation results only for the Hebb/decay rule without the extra SSB term. The simulation results illustrated in Figure 4 are of one-dimensional networks. In these simulations, the units and connections illustrated are intended to represent a continuum. The connection densities for afferent and lateral excitatory connections were chosen to be smoothly varying functions. Mathematically, the connection density is implemented by making the following substitution in the network equations presented previously: W(z - y) H G(z - y)W(z,y). Here, G(z - y ) is the translationally invariant connection density and W ( z ,y) are the plastic (and generally not translationally invariant) weight values between presynaptic neurons at location y and postsynaptic neurons at location z. In the actual simulations, continuous coordinates such as z are discretized on a one-dimensional lattice of 64 units per layer. The lattices have periodic boundary conditions. The input activations were uniform random values in the range [0,1]. The input activations were spatially and temporally uncorrelated. Each input pattern was presented for only one dynamic relaxation time of the network (10 time steps). The following adaptation rate parameters were used: dynamic relaxation rate T ~ - I= 0.1,learning rate Ti:bb = 0.01, weight decay constant ,L3 = 0.125 for linear and sigmoidal networks, and /3 = 0.1 for the shunting network.
Figure 4: (Opposite page.) Development of modularity via Hebbian learning. (A) Linear additive model. Time development of afferent modules. Network states are displayed for snapshots at times 50 (top), 350, 550, and 1500 (bottom) iterations. The upper row of units in each snapshot indicates the receptor cell activities, the lower row indicates the internal unit activities. The afferent connection values are indicated by lines connecting the receptor units to the internal units. Both the activities and the connection values are color coded in quintiles: blue (minimal), green, yellow, orange, red (maximal). The activities are coded on a normalized scale, while the connection weights are coded on an absolute scale over the range I0,ll. (B) Sigmoidal additive model. Time development of afferent modules. Network states are displayed for snapshots at times 50 (top), 350, 550, and 2100 (bottom) iterations. (C) Shunting inhibition model. Final state (10,000 iterations) after development of both afferent and lateral modules. The rows of units represent activities of receptors (upper), internal excitatory cells (second and third), internal inhibitory cells (last). The excitatory cells are displayed twice for representational convenience. The upper layer of connections is the plastic afferents, while the lower layer is the plastic laterals between internal excitatory units. Connections between internal excitatory and inhibitory cells are not shown. Note that the modules in the shunting model are significantly more regular than the modules in either of the additive models.
Spontaneous Development of Modularity
347
Alex Chemjavsky and John Moody
348
The linear model (Figure 4A), and sigmoidal model (Figure 4B), used a gaussian afferent connection density (sigma= 1.4 lattice units) and a laplacian of gaussian lateral connection density (sigma= 2.0 lattice units). The integrated magnitudes of both the afferent and lateral connection densities were 1.0. The maximum fanout of afferent and lateral connections was F = 9 lattice units. The linear model used E = 1 and 0 = 0. The sigmoidal model used E = 0.22 and 0 = 0.5. The same random seed was used in both the linear and sigmoidal simulations. Note that the afferent modular structures which developed are quite similar. Also note in Figure 4A and B that at times even before afferent modules have developed, collective excitations are observable in the internal unit activities even though the receptor activities are spatially uncorrelated. For the shunting model (Figure 4C), the connection densities for afferent and lateral excitatory connections were chosen to be gaussian with a maximum fan-out of 9 lattice units. The inhibitory connection density had a maximum fan-in of 19 lattice units and had a symmetric bimodal shape. The sigmas of the excitatory and inhibitory fan-ins were, respectively, 1.4 and 2.1 (short-range excitation and longer range inhibition). The linear excitatory units had E = 1 and 0 = 0, while the sigmoidal inhibitory units had E = 0.175 and 8 = 0.5. Note that the activities of the inhibitory units are anticorrelated with the activities of the excitatory units in Figure 4C as expected. Also note that the modules are much more regular in the shunting model than in either the linear or sigmoidal models. As is apparent from Figure 4, modular structures in the plastic afferent connection patterns developed for all three network models considered and lateral connectivity modules developed for the shunting model.
5 Discussion of Results Three observations should be made regarding our work: 0
0
0
We have focused on identifying, analyzing, and simulating basic dynamic mechanisms that can give rise to the activity-dependent development of both afferent and lateral modular structures and have not attempted a biologically detailed simulation. For computational simplicity, our simulations are of one-dimensional cortical models. However, our analysis carries over directly with no substantial modification to two- and three-dimensional networks. In two and three dimensions, the collective excitation phenomenon will likely not only persist, but may take on a richer repertoire of patterns, for example stripes and/or spots. Our choices of simple Hebb rules are somewhat arbitrary and should be viewed as only crude proxies for the true synaptic plasticity rules
Spontaneous Development of Modularity
349
(in the case of functional modules) and for complicated, activitydependent developmental processes (in the case of anatomical modules). Several general conclusions emerge from this work 0
0
0
0
0
0
0
We have proposed and tested dynamic mechanisms by which modular structures at specific spatial scales might develop. The existence of collective excitations is a generic spontaneous symmetry breaking phenomenon for the kinds of network models we have considered. Collective excitations on a specific spatial scale emerge spontaneously given appropriate lateral connectivity and sufficiently high gains. Local lateral excitation and longer range lateral inhibition give rise to collective excitations in the additive models. A symmetric bimodal lateral inhibition pattern gives rise to collective excitations in the shunting model. When collective excitations are not present, modularity was not observed to develop. The models we have considered can support the development of afferent modules, lateral modules, or both.
Three specific conclusions regarding the choice of Hebb rule also emerge: 0
0
0
When collective excitations are present, the Hebb/decay adaptation rule resulted in the development of both afferent and lateral modules in the various models. When collective excitations are present, the Hebb/anti-Hebb adaptation rule resulted in the development of only lateral modules in the shunting model. Spontaneous symmetry breaking terms in the Hebb rules (suggested by Kammen and Yuille 1988) were not required to ensure the stability of the resulting modular structures.
Four important questions remain to be answered in the future: 0
0
How do modular patterns in the lateral connections of cortex develop?
Do real cortical networks exhibit the collective excitation phenomenon? If so, under what conditions?
Alex Chernjavsky and John Moody
350
0
0
Do simple Hebb rules capture or effectively emulate either the modification of individual synapses or complex activity-dependent developmental processes? What simple Hebb rules (if any) are the most appropriate to model synaptic plasticity or development at a cortical level? (Note that the appropriate mathematical description at the cortical level is likely to differ from that which is appropriate to model plasticity of individual synapses.) What is required to construct biologically detailed and convincing models of activity-dependent development in real cortical structures?
The answers to all four questions will come only from detailed biological investigations coupled closely with more complete mathematical models and computer simulations. 6 Comparison to Related Work
We first compare our work to the simulations of Pearson et al. (1987) and then to other related models. 6.1 Comparison to Pearson et al. The work of Pearson et al. (1987) modeled the development and plasticity of neuronal groups in somatosensory cortex and the role of groups in the plasticity of somatosensory maps. Although Edelman’s theory of neuronal group selection is viewed by many as rather speculative and vague (see for example Crick 1989), Pearson’s simulations stand on their own and can be interpreted as being relevant to the development and plasticity of cortical columns, colonies, or barrels. Our theory is both supported by and elucidates the simulations of Pearson et al. However, a few important differences should be noted. In comparison to our models, the network dynamics and synaptic plasticity rules of Pearson et al. were highly nonlinear and somewhat complex. Their network dynamics contained 3 independent sources of nonlinearity and 9 parameters. Their synaptic plasticity rule involved 5 coupled differential equations and 14 adjustable parameters. We chose to consider much simpler models to obtain analytical tractability and thereby more easily identify the basic dynamic mechanisms responsible for the development of modules. Each of our network models contains at most one intrinsic nonlinearity and at most three adjustable parameters (T+, t, 6). Our bilinear Hebb rules utilized only two or three parameters (TH~F,~,p or y,sometimes IC). We believe that our models are the minimal models that exhibit the phenomena we have studied. Furthermore, we have specifically distinguished between afferent modularity and lateral modularity. Pearson et al. implicitly understood ”groups”
Spontaneous Development of Modularity
351
to mean modular patterns in the lateral connections. They did not address the issue of afferent modularity in their paper, although they did observe afferent groups developing in simulations (Pearson 1990). As we have found, however, the development of stable afferent modules depends on the choice of Hebb rule. The distinction between afferent and lateral modularity may be relevant biologically from the standpoints of both development and information processing. It is important to reiterate one key distinction betwren our simulations and those of Pearson et al. We used spatially random, uncorrelated afferent stimulation, while the Pearson simulations used localized patterns of afferent stimulus. Random stimulus makes the formation of modular structures on the basis of chance much less likely, so our simulations give support to our hypothesis that spontaneous instabilities (collective excitations) in the lateral interaction dynamics are responsible for inducing the development of modules. The use of localized stimulation patterns would only make modules form more quickly in our models. 6.2 Comparison to Other Related Work. Besides the Pearson work, the most closely related models to ours are those of ocular dominance column formation by Swindale (1980) and Miller et al. (1989).’ The Swindale model did not explicitly contain a network, hut rather assumed abstract nonlocal interactions between synapses in a region of cortex. The Miller model attempted to identify generic mechanisms, but did not focus specifically on network dynamics.2 It used the Hebb/decay adaptation rule, and reduces to the Swindale model in a certain limit. A fundamental difference between the ocular dominance models and our more generic modularity models is that the former utilize two distinct, competing sets of afferents (from left and right eyes) and assume spatially correlated input from each eye. The resulting structures appear qualitatively similar to our results, however. In fact, by adjusting parameters of the Hebb/decay rule, the internal unit gains €-’, and the patterns of lateral connectivity, we have obtained simulation results in which the spaces between modules, “dead regions,” are equal in size to the modules themselves. This corresponds
’It should be noted for completeness that the Swindale and Miller models were preceded by the work of von der Malsburg (1979) who simulated the formation of ocular dominance columns using only chemical markers and did not include activity dependent modification. 2After we completed our analysis and simulations, it was pointed out to us that footnote (15) of Miller et al. (1989) discusses a network interpretation of their model that is mathematically equivalent to our linear model. Miller et al., however, did not analyze the dynamics of the linear model or discover the phenomenon of collective excitations.
352
Alex Chernjavsky and John Moody
mathematically to the observed fact that ocular dominance columns for left and right eyes are equal in width. Other related works utilizing Hebbian learning rules are the models of the development of receptive field properties of individual cells, specifically spatial opponent cells (Linsker 1986a) and orientation selective cells (Linsker 1986b; Kammen and Yuille 1988). These models used multilayer feedfonvard networks with no lateral interactions and depended on spatially correlated input (after the first layer) to obtain the observed results. The Kammen and Yuille work differed from the Linsker models in that it incorporated a spontaneous symmetry breaking mechanism in the Hebb dynamics. We found such a mechanism to be unnecessary for the development of modules. Two models of orientation selective column formation due to Swindale (1982) and Linsker (1986~)deserve mention. These models did not make direct reference to network architectures, but rather assumed abstract or effective lateral interactions in a field of orientation vectors. The Swindale model assumed local excitation and longer range inhibition, while the Linsker model assumed only weak local excitation. Finally, Whitelaw and Cowan (1981) presented a fairly convincing model for the development of retinotectal maps. This model incorporated not only activity-dependent synaptic plasticity, but also the effects of chemospecific adhesion markers in the development of synapses. The tectal network model featured local lateral excitatory connections. The key point in our comparison is that these previous models, while addressing a variety of important developmental phenomena, have largely ignored the possibility that interesting nonlinear network dynamics due to lateral interactions in a layer of cells can be instrumental in guiding the development process. On the other hand, we view our work as only a first step in studying the effects of network dynamics on cortical development. We believe that future, more detailed work in this direction may uncover mechanisms operative in the development of a rich variety of cortical structures. Note Added in Proof
After this paper was completed and accepted for publication, we became aware of work by Takeuchi and Amari (1979) that models the formation of columnar structures. Their network model is a nonlinear additive model with sharp thresholds, and they use the Hebb/decay learning rule as we do. However, they emphasize the role of localized afferent stimulation patterns in driving the formation of columnar structures and do not address the possible existence or importance of collective excitations in the network dynamics and their consequent role in the development of modules.
Spontaneous Development of Modularity
353
Acknowledgments The authors wish to thank Valentino Braitenberg, George Carman, Martha Constantine-Paton, Kamil Grajsh, Daniel Kammen, Ken Miller, John Pearson, Terry Sejnowski, and Gordon Shepherd for helpful comments. A. C. thanks Stephen J. Smith for the freedom to pursue projects outside the laboratory. J. M. was supported by ONR Grant N00014-89-J-1228 and AFOSR Grant 89-0478. A.C. was supported by the Howard Hughes Medical Institute and by the Yale Neuroscience Program.
References Braitenberg, V. 1978. Cell assemblies in the cerebral cortex. In Lecture Notes in Biomathematics, Vol. 21, R. Heim and G. Palm, eds., p. 171. Springer-Verlag, New York. Chernjavsky, A., and Moody, J. 1990. Note on development of modularity in simple cortical models. In Advances in Neural Information Processing Systems, Vol. 2, D. Touretzky, ed. Morgan Kaufmann, Palo Alto. Cowan, J. D. 1982. Spontaneous symmetry breaking in large scale nervous activity. Int. 1. Quant. Chem. 22, 1059. Crick, F. 1989. Neural Edelmanism. Trends Neurosci. 12(7),340. Goldman-Rakic, P. 1981. Prenatal formation of cortical input and development of cytoarchitectoniccompartments in the neostriatum of the rhesus monkey. J. Neurosci. 1, 721. Kaas, J. H., Merzenich, M. M., and Killackey, H. P. 1983. The reorganization of somatosensory cortex following peripheral nerve damage in adult and developing animals. Ann. Rev. Neurosci. 6, 325. Kalil, R. E. 1989. Synapse formation in the developing brain. Sci. Am., December, 76-85. Kammen, D., and Yuille, A. 1988. Spontaneous symmetry-breaking energy functions and the emergence of orientation selective cortical cells. Bid. Cybernet. 59, 23. Linsker, R. 1986a,b,c. From basic network principles to neural architectures: Emergence of (a) spatial-opponent cells, (b) orientation-selective cells, and (c) orientation columns. Proc. Natl. Acad. Sci. U.S.A. 83, 7508, 8390, 8779. Meinhardt, H. 1982. Models of Biological Pattern Formation. Academic Press, New York. Miller, K. D., Keller, J. B., and Stryker, M. P. 1989. Ocular dominance column development: Analysis and simulation. Science 245, 605. Moody, J. 1990. Dynamics of lateral interaction networks. Tech. Rep., Yale University. (In Preparation.) Mountcastle, V. B. 1957. Modality and topographic properties of single neurons of cat’s somatic sensory cortex. J. Neurophysiol. 20, 408. Mountcastle, V. 8. 1978. An organizing principle for cerebral function: The unit module and the distributed system. In The Mindful Brain, G. Edelman and V. B. Mountcastle, eds., pp. 7-50. MIT Press, Cambridge, MA.
354
Alex Chernjavsky and John Moody
Murray, J. D. 1988. How the leopard gets its spots. Sci. Am. 258, 80. Pearson, J. C., Finkel, L. H., and Edelman, G. M. 1987. Plasticity in the organization of adult cerebral cortical maps: A computer simulation based on neuronal group selection. J. Neurosci. 7, 4209. Pearson, J. C. 1990. Personal communication. Rice, F. L., Gomez, C., Barstow, C., Burnet, A., and Sands, P. 1985. A comparative analysis of the development of the primary somatosensory cortex: Interspecies similarities during barrel and laminar development. 1. Comp. Neurol. 236, 477. Sejnowski, T. 1977. Strong covariance with nonlinearly interacting neurons. J. Math. Biol. 4, 303. Stryker, M. P. 1986. The role of neural activity in rearranging connections in the central visual system. In The Biology of Change in Otolaryngology, R. J. Ruben et al., eds., pp. 211-224. Elsevier Science, Amsterdam. Swindale, N. V. 1980. A model for the formation of ocular dominance stripes. Proc. R. SOC.London B208, 243. Swindale, N. V. 1982. A model for the formation of orientation columns. Proc. R. SOC. London B215,211. Takeuchi, A., and Amari, S. 1979. Formation of topographic maps and columnar microstructures in nerve fields. Biol. Cybernet. 35, 63. Turing, A. 1952. The chemical basis of morphogenesis. Phil. Transact. R. SOC. London B237,37. von der Malsburg, Ch. 1979. Development of ocularity domains and growth behaviour in axon terminals. Biol. Cybernet. 32, 49. Whitelaw, V. A., and Cowan, J. D. 1981. Specificity and plasticity of retinotectal connections: A computational model. J. Neurosci. 1, 1369.
Received 18 January 90; accepted 22 June 90.
Communicated by Andrew Barto
The Bootstrap Widrow-Hoff Rule as a Cluster-Formation Algorithm Geoffrey E. Hinton Steven J. Nowlan Department of Computer Science, University of Toronto, 10 King's College Road, Toronto M5S ZA4, Canada
An algorithm that is widely used for adaptive equalization in current modems is the "bootstrap" or "decision-directed" version of the Widrow-Hoff rule. We show that this algorithm c a n be viewed as an unsupervised clustering algorithm in which the data points are transformed so that they form two clusters that are as tight as possible. The standard algorithm performs gradient ascent in a crude model of the log likelihood of generating the transformed data points from two gaussian distributions with fixed centers. Better convergence is achieved by using the exact gradient of the log likelihood. 1 Introduction Modems are used to transmit bits along analog lines. Since the bandwidth is limited, it is impossible to transmit square pulses so the contribution that each individual bit makes to the transmitted signal is necessarily extended in time as shown in Figure 1. The received signal is strobed at the points where the individual bits make their maximum contributions, but the "intersymbol interference" from nearby bits can cause the strobe values to have the wrong sign. So a simple threshold decision rule based on the strobe value alone is imperfect. Better decisions can be made by taking into account the local temporal context. A simple way to use local context is to have delay taps that represent the value of the received signal at nearby strobe times.' The delay taps can be used as the input lines to a linear unit that forms a weighted combination of a strobe value with nearby values before the thresholding operation is applied. If the desired outputs of the thresholding operation are known, the weights can be trained by using a supervised learning procedure such as Widrow-Hoff (Widrow and Hoff 1960). When the output should be above (below) threshold, we assume that the desired output value is +l (-1). Unfortunately, this requires that a known bitstring be transmitted. 'Fractional delay times that sample more frequently than the strobe period are also used.
Neural Computation
2, 355-362 (1990)
@ 1990 Massachusetts Institute of Technology
Geoffrey E. Hinton and Steven J. Nowlan
356
\
-5
0
5
Figure 1: Contribution of an individual bit to the transmitted signal.
If we wish to avoid sending a known signal, or if we wish to continually adapt to the properties of the line, then we need some other way of deciding what the correct output of the decision process should have been. In the "bootstrapping" or "decision-directed algorithm (Lucky 1966), the actual output of the linear unit is thresholded at zero and it is assumed that the decision is correct. So whenever the actual output is above zero, we assume that the desired output is +1 and we adjust the weights accordingly. Figure 2 uses a very simple example to show why this procedure works. If the initial weights cause only a few cases to be misclassified, the learning that takes place for the correctly classified cases will eventually correct the errors, even though the learning on the wrongly classified cases moves the weights in the wrong direction. The bootstrapping algorithm works well in practice (Widrow and Stearns 1985; Qureshi 1985) and it is one of the most important current applications of neural networks. However, there seems to be little theoretical justification for using the thresholded actual output to determine the desired output. The aim of this paper is to provide a justification by showing that the bootstrapping algorithm can be viewed as an unsupervised cluster-formation algorithm that has a close relationship to competitive learning. Like competitive learning, the bootstrapping algorithm makes a "winner-take-all" decision in deciding which cluster
Bootstrap Widrow-Hoff Rule Algorithm
357
Figure 2: The 2 1 and x2 axes represent the values on two delay taps that are used as input to a linear unit. The two ellipses represent the distributions of two clusters of input vectors. The thick line labeled w(t0) represents the initial weight vector. When the two clusters are projected onto this weight vector, they yield the two overlapping gaussians shown beneath. With learning, the weight vector rotates so that the projection of the clusters forms two well-separated gaussians. should be given responsibility for a given data point. Its convergence can be improved by using a statistically more reasonable "soft" decision in which the responsibility of a cluster for a data point depends on the relative probability of generating the data point from the cluster. 2 The Objective Function for the Bootstrapping Procedure One simple form of competitive learning works by minimizing the squared distance between each data point and the nearest cluster center by moving the cluster center toward the data point by an amount proportional to their separation (Hinton 1989). If we view the output of the linear unit as a data point, the bootstrap algorithm minimizes the same objective function by moving the data point toward the center of the nearest cluster. Figure 2 shows that the bootstrapping procedure adjusts the weight vector so that the actual output of the linear unit forms two sharp clusters, one centered around +l and the other centered around -1. This suggests the following interpretation of what the bootstrapping is really achieving.
358
Geoffrey E. Hinton and Steven J. Nowlan
We start with the prior belief that the output of the linear unit can be modeled by two gaussian distributions, one centered at -1 (for the 0 bits) and the other centered at +1 (for the 1 bits). We want to make the outputs of the linear unit fit this mixture of gaussians model as well as possible. This can be accomplished by modifying the weights of the linear unit and the parameters of the gaussians to maximize the log likelihood of generating the observed output values from the mixture of gaussians model:
where xt is the output value of the linear unit for the tth training case, and p2 are the proportions of the two gaussians in the mixture, 01 and oz are their variances, and pl and p2 are their means. Later we will consider how to adapt the parameters of the two gaussians, but initially we will assume that these parameters are fixed and only the weights of the linear unit can be adapted. To perform the online version of gradient ascent in logL, we need to change each weight, wi, by p1
(2.2)
where E is the learning rate, a: is the activity level of the ith delay tap, and xt is the output of the linear unit in case t. A crude way to deal with the sum of two exponential terms in equation 2.1 is to ignore the smaller one. If we assume that the two gaussians have equal variances and equal mixing proportions, this amounts to ignoring the possibility that an output value xt could have been generated from the gaussian that is farther away. With this crude simplification and the assumption that p1 < p2 we get a very simple expression for the derivative of the log likelihood
The l/02term simply modifies the learning rate, so if we set ,u1 = -1 and PZ = +1 and substitute equation 2.3 into equation 2.2 we get exactly the bootstrap Widrow-Hoff procedure. The simplification used is identical to the simplification used in "hard competitive learning in which the weights that represent the "center" of a competitive unit are regressed toward the current input vector if and only if that competitive unit wins the competition. This is equivalent to treating the competitive units as gaussians of fixed variance and only adapting the center of the gaussian most likely to have generated the data.
Bootstrap Widrow-Hoff Rule Algorithm
359
3 A Correct Maximum Likelihood Learning Procedure
We continue to assume that the gaussians have equal variances and equal mixing proportions, but we now take into account the fact that any given output value could be generated by either gaussian. Equation 2.1 then yields
Comparing equation 3.1 with equation 2.3, we see that the hard decision between the two gaussians has been replaced by a soft, sigmoid function which varies smoothly from 0 to 1 with a value of 0.5 at the midpoint of the two gaussians. If p1 = -p2 = -1 and 0 = 1, the exponent in the expression for X is simply 2xt. The weakness of the hard decision used by the bootstrapping procedure is apparent when we realize that when the output values are very near 0 they are most likely to have the wrong sign because these are the cases when the hard decision is most likely to be incorrect (assuming that the means of the two gaussians are symmetric about 0). Yet it is in these cases that the weights are changed the most. In the "soft" algorithm, with its sigmoid decision function, the two terms approximately cancel out for output values near 0, so it makes only small weight changes in these highly ambiguous cases. Following this line of reasoning, we might expect that if there is enough noise and distortion to force the outputs to be frequently in the region near 0, the "soft" algorithm will outperform the "hard algorithm. Simulation results support this conclusion. Figure 3 shows a set of typical simulation results. The curves in this figure show the mean squared error (in dB) versus the number of updates for the "soft" model with several different values for 0,the variance of the two gaussians. Larger values of u correspond to a shallower slope for the sigmoidal decision function. 0 = 0 corresponds to the "hard" decision rule. The mean squared error is decreased most rapidly initially by using relatively large values of 0, showing the superiority of the soft algorithm when the output decisions are prone to error. The hard algorithm eventually catches up to the soft algorithm as the equalizer adapts and the output values become close to +1 ( p 2 ) or -1 (pl)most of the time. The most rapidly converging algorithm, labeled by var in the figure, continuously reestimated the variance while using the soft decision rule. This amounts to adaptively adjusting the learning rate (see below). In modem equalization it is crucial to be able to adapt rapidly to sudden fluctuations in the transmission line. The use of a "soft" decision
Geoffrey E. Hinton and Steven J. Nowlan
360
0
200
400
Figure 3: Mean squared error versus the number of updates for a signal with moderate distortion. rule in the bootstrap learning procedure allows more rapid adaptation to sudden fluctuations by discounting the error terms that are most likely to be incorrect. 4 Adapting the Parameters of the Gaussians
If the means of the gaussians were allowed to adapt, both means and all the output values would converge on the same value. However, the mixing proportions and the variances can be adapted to maximize log likelihood. The most tractable step in this direction is to assume the mixing proportions are fixed and equal and to allow the variances to adapt subject to the constraint that they remain equal. One method of adapting the variances would be to change o by an amount proportional to the derivative of the log likelihood. A faster method, that we used in the simulation labeled var in Figure 3, is an incremental version of the EM algorithm (Dempster et al. 1976). The batch version of EM simply sets (T to a value guaranteed to yield higher likelihood (given the current output values) after a batch of training cases. The incremental version uses an exponential decay factor to weight previous cases, and then applies EM after every training case. The actual update rule then becomes
d ( t + 1) = d ( t )+ (1 - 6 ) [ X ( d
-
PI)*
+ (1 - X ) ( d
-
4
(4.1)
where IC is a decay rate slightly less than 1 for discounting past data, and X is defined in equation 3.2. This incremental method does not
Bootstrap Widrow-Hoff Rule Algorithm
361
require a learning rate for the variance adaptation. The decay rate for the exponential averaging of past data is based on the degree of stationarity of the data, not on properties of the learning algorithm.
5 The Extension to Multilayer Networks
Our analysis of the bootstrapping procedure used for adaptive equalization can be extended to more complex networks and gives rise to a new class of unsupervised learning procedures in which the objective of forming tight clusters is used to generate an error signal that can be backpropagated through layers of nonlinear units. To prevent all the input vectors from being mapped to the same output vector, we can fix the cluster centers in advance, or we can use a scale-invariant objective function that takes the distance between cluster centers into account when evaluating the tightness and thus eliminates the ability to improve the objective function by simply moving the cluster centers closer together. To prevent arbitrary mappings of input vectors into clusters, we can insist on using relatively simple functions. For example, we can assume that the log likelihood of a function is proportional to the sum of the squares of the weights that it uses and we can then trade-off cluster tightness against the prior log likelihood of the function. A more sophisticated version of this idea would be to estimate the complexity of the function by fitting a mixture of gaussians model to the set of weight values and using the combined description length of the mixture model and the weights given that model as an upper bound on the complexity. In this case, we would be trading off the tightness of the clusters of output values against the tightness of the clusters of weight values. Bridle has independently developed a similar approach to unsupervised cluster-formation in multilayer networks (John Bridle, unpublished research note SP4-RN66, October 1988).
6 Discussion
Learning procedures for neural networks are often designed by an intuitive leap (Durbin and Willshaw 1987; Crick and Mitchison 1983). This is particularly true of the unsupervised learning procedures. Those that actually work in practice, are usually found to be an approximation to steepest ascent in some sensible statistical measure such as log likelihood or mutual information (Durbin et al. 1989; Hinton and Sejnowski 1986). The bootstrapping algorithm is no exception. This adds further support to the idea that future unsupervised algorithms should be designed by explictly differentiating a sensible objective function, and then making approximations to achieve easy implementation.
362
Geoffrey E. Hinton and Steven J. Nowlan
Acknowledgments We thank Yann Le Cun and John Bridle for helpful discussions. This research was funded by a grant from the Ontario Information Technology Research Center.
References Crick, F., and Mitchison, G. 1983. The function of dream sleep. Nature (London) 304,111-114. Dempster, A. P.,Laird, N. M., and Rubin, D. B. 1976. Maximum likelihood from incomplete data via the EM algorithm. Proc. R. Stat. SOC. 1-38. Durbin, R., Szeliski, R., and Yuille, A. 1989. An analysis of the elastic net approach to the travelling salesman problem. Neural Comp. 1,348-358. Durbin, R., and Willshaw, D. 1987. The elastic net method: An analogue approach to the travelling salesman problem. Nature (London) 326, 689-4591. Hinton, G. E. 1989. Connectionist learning procedures. Artificial Intelligence 40, 185-234. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, D. E. Rumelhart, J. L. McClelland, and the PDP group, eds. MIT Press, Cambridge, MA. Lucky, R. W. 1966. Techniques for adaptive equalization of digital communications systems. Bell Syst. Tech. I. 45, 255-286. Qureshi, S. U. H. 1985. Adaptive equalization. Proc. IEEE 73,1349-1387. Widrow, B., and Hoff, M. E., Jr. 1960. Adaptive switching circuits. IRE WESCON Convention Rec. 96-104. Widrow, B., and Steams, S. D. 1985. Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs, NJ.
Received 15 March 90; accepted 3 May 90.
Communicated by Richard Durbin
The Effects of Precision Constraints in a Backpropagation Learning Network Paul W. Hollis John S. Harper John J. Paulos Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27695-7911 U S A
This paper presents a study of precision constraints imposed by a hybrid chip architecture with analog neurons and digital backpropagation calculations. Conversions between the analog and digital domains and weight storage restrictions impose precision limits on both analog and digital calculations. It is shown through simulations that a learning system of this nature can be implemented in spite of limited resolution in the analog circuits and using fixed point arithmetic to implement the backpropagation algorithm. 1 Introduction Learning networks employing error backpropagation have been widely used in simulation environments where CPU time is the major limiting factor. Although a silicon implementation that can learn and function in real time is desirable for many applications, it imposes constraints that are not present in simulation environments. Among these constraints are the fidelity of the artificial neuron to the intended neuron response, the resolution of mathematical calculations, and the methods for modification and storage of weights. The major thrust of this work is to study the impact of bounded weights with limited resolution and limited calculation precision on the quality and speed of learning. A basic architecture utilizing analog neurons and digital backpropagation calculations is assumed. Simulations probe the effects of constraints imposed by transportation of data between analog and digital domains. Encouraging results were obtained using a function learning problem as a benchmark. 2 Benchmark Problem
A simple well-known benchmark, described in Lippman and Beckman (1989), in which the network learns a continuous nonlinear mapping from a single input data point to a single output point was chosen. The Neural Computation 2,363-373 (1990) @ 1990 Massachusetts Institute of Technology
Paul W. Hollis et aI.
364
network has two hidden layers in a feedforward topology with 1 input node, 20 first hidden layer nodes, 5 second hidden layer nodes, 1 output node, and the appropriate bias nodes. The output node is linear while all hidden nodes have a sigmoidal nonlinearity. The input to the network is a single randomly generated point in the interval [-1.0,+1.0] and the target is the mapping from a relatively steep sigmoidal curve. For ease of data handling, an epoch is defined as a sequence of 50 such points. Incremental learning is employed; that is, there is one training cycle (one feedforward and one error backpropagation phase) for each data point. 3 Target Architecture
The simulator was developed with a specific network architecture (Paulos and Hollis 1988a) and analog neuron model (Paulos and Hollis 1988b; Hollis and Paulos 1990) in mind. The application for this architecture is speech enhancement using real time signal processing on a single, timevarying input to produce a single analog output. The network will learn to compensate for specific hearing deficiencies through training with an individualized perceptual model (White et al. 1990). Figure 1 shows a block diagram of this architecture. The continuous time input z ( t ) will be sampled and shifted through an analog delay-line. The analog values in the delay-line 1 4. are the inputs to the analog feedforward network. Feedforward calculations are performed by analog neurons whose weights are digitally controlled by values (wzJ) stored in large
Target
Output
I
Analog Feed-forward Network
Digital Backpropagation Processur
I
I
I
I
I-
_______-______
Learning Chip
Analog Delay-line
I-
I
+x(t) I
Input
I I
- - - - - - -- - - - _ .
Figure 1: Functional block diagram of the target network architecture.
Effects of Precision Constraints
365
serially connected shift registers. During the learning phase, weights are shifted into a digital backpropagation processor for modification by specialized but relatively simple weight processing units (WPUs) using the generalized delta rule (Rumelhart et al. 1986). There is one WPU for each hidden layer neuron and the output neuron. The updated weights w: are then shifted back to the feedforward network for the next feedforward phase. Each WPU also requires a digital representation of its associated neuron output as well as the outputs y3 from the previous layer neurons. The neuron outputs will be analog-to-digital (A/D) converted and loaded into the WPUs in synchrony with the weights. The analog output y[n] changes each sample period and is compared with a target (produced by the perceptual model) to generate the error signal 6 that must also be converted to a digital value. 4 Network Constraints
Finite precision errors can arise in three places with this architecture. The most obvious of these is the precision of the WPU calculations. Although there is a tradeoff in the WPU between physical size and precision, the real limit to the effective precision of the backpropagation calculations is the length in bits of each weight register. Any additional precision in a WPU result (assuming the WPU internal registers are at least as large as a weight register) is thrown away when stored in a lower precision weight register. Therefore the precision of the calculations in the WPU is synonymous with, and will be referred to as, weight precision. Another source of precision errors is the resolution of the analog weight values used in feedforward calculations. Since weights are stored as binary values in registers, the precision of the digital-to-analog (D/A) weight conversion can further constrain the actual feedforward weight values. The specific implementation under study utilizes a MOSFET analog multiplier neuron with a 6-bit plus sign weight (Paulos and Hollis 1988b; Hollis and Paulos 1990), which is approximately half of the weight register length. Since only the seven most significant bits (MSBs) will determine the weight value used in the feedforward calculation, this constraint will be referred to as weight truncation. Neuron output values are required by the backpropagation algorithm (Rumelhart et al. 1986) for weight change calculations. The analog neuron outputs must be converted to digital values on the chip and stored in registers for use in the WPUs. Although several methods of A/D conversion are feasible, all become much more complex and consume considerable area if high resolution is required. The number of bits representing a neuron output after A/D conversion will be referred to as the output quantization of a neuron. In backpropagation learning simulators weights are generally allowed to grow without bound. On a chip, weights stored in integer format
366
Paul W. Hollis et al.
in registers become bounded. Simulation studies have shown that by scaling the neuron gains and learning rates, premature clipping (reaching one of the bounds) of the weights can be avoided without significantly affecting the solution quality. No momentum term is present in weight calculations because of the cost it would physically and temporally incur. Two complete sets of weights would be required (the current and previous weight values) and there would be increased overhead from additional calculations and control. 5 Simulator Implementation
The Boltzmann activation function was used for simulations rather than the model for the neuron that will be used in the silicon implementation. This is more consistent with standard network simulators and makes the results less dependent on the specific neuron circuit. However, some modifications were made to more closely model the network architecture. The neuron model for the target architecture has a differential output which can swing symmetrically between a negative and positive limit, and can be represented by continuous values over the range [-1.0,1.01. The Boltzmann activation function for output yi, modified to accommodate this range, is shown below:
This modification simply scales the output range and centers it about zero. A, represents the gain of the neuron and uiis a linear sum of the weight-input product terms contributing to neuron i. In the simulator, weights are represented as floating point numbers bounded by limits of 1-1.0,l.Ol. They can therefore saturate or clip just as neuron outputs can saturate. A truncation function T limits the values a weight can assume in feedforward calculations to an integer number (power of 2) of values uniformly distributed over the weight range. This implements weight truncation as discussed earlier and represents the resolution of an analog feedforward weight. The equation for the summation ui becomes
where yj is the output from neuron j in the layer feeding forward to neu) ron i, wy is the weight between the two neurons, and T ( w , ~represents a truncation of the stored weight values. For backpropagation y,(l -yz)A,, the derivative of the standard activation function, must be modified to (1-yi)(l +yi)A,/4 to generate the error equations. Notice that neither of these derivative functions introduces a
Effects of Precision Constraints
367
sign change for their respective output ranges. The error equations now become
Yi)P
-
(5.3)
or (5.4) In equation 5.3 6,is the error for output layer neuron 2, while yz is its output value and t, is the associated target value. Q is an output quantization function that represents the precision of the A/D conversion of yt. In equation 5.4 63 is the error for a hidden layer neuron 3 where yJ is its output, 6, is the error propagating backward from output neuron 2, and wtJ is the weight feeding that error to neuron 3. P(w,,) is a weight precision function that represents the number of bits in a weight register. Both the output quantization and weight precision functions operate in a manner similar to the truncation function above. A weight is updated by changing the original value by some small amount. This change is made using the relationship (5.5) where w& is the new value for a weight between neurons i and j . 7 is the learning rate associated with weights between the two layers containing neurons i and j . Although 7 and A, are usually global quantities, the simulator allows different 7 and A, values to be associated with weights of different layers. 6 Results
All of the parameter scaling and resolution experiments were performed using the modified Boltzmann activation neuron. During each training cycle the root mean square (RMS) error was calculated for the output and target values associated with the input data point. These errors were then averaged over 5 epochs (250 points) and this average will be referred to as the group error. Training in the simulator continued for 1600 epochs or until the group error became less than 0.005. In addition several group error milestones (error criteria of 0.005, 0.01, 0.02, and 0.05) were established and the number of epochs at which each level was reached was logged. Figure 2 shows the sigmoid function being learned along with typical network outputs when trained to each error criterion. Weights were initialized with small random values not exceeding 5% of the full range. Weight precision, weight truncation, and output quantization were each tested individually with decreasing resolution (with the remaining two fixed at maximum resolution) to determine the minimum usable values for each.
Paul W. Hollis et al.
368
0.6
-
0.4
-
.3
-
-0.6 -0.8
----
. I
-1
-1
-0.8
-0.6
-0.4
-0.2
I
,
0
0.2
,
0.4
0.6
0.8
1
INPUT-data points
Figure 2: Output mappings for network trained to RMS errors of 0.05 and 0.01 compared to target function. Initial simulations were performed to determine values for the qs and gains at each layer that would provide good solutions over a range of lower precision values and allow utilization of the full weight range without premature clipping. Good q values were 1.33 and 1.6 for the first and second hidden layers and 0.5 for the output neuron, while the associated gain values were 3.0,2.5, and 1.0. Lower qs and/or gains improve results at higher precisions, but degrade results at lower precisions. Higher qs and/or gains cause instability (weights rapidly clip and no learning occurs) regardless of the precision. Figure 3 shows the simulation results for differing weight precisions. Above each weight precision value is a group of four lines that represents the spread (endpoints of each line) and the average (bar toward middle of each line) of solution times (in epochs) for each of the four error criteria mentioned earlier. The leftmost line in each group represents the lowest or most difficult criterion to meet (0.005 RMS) while the rightmost line represents the easiest criterion to meet (0.05 RMS). The number above each group gives the percentage of trials that converged at each precision for all of the members of that group not individually labeled. For instance, reading from left to right at 12 bits of precision, 85% of the trials converged to error criteria of
Effects of Precision Constraints
369
600 To
90% /
11
12
13
14
15
16
WEIGHT PRECISION--in number of bits
Figure 3: Minimum, average, and maximum convergence times in epochs for 20 trials at each weight precision. From left to right each line in a precision group is for convergence to criteria of 0.005, 0.01, 0.02, and 0.05. Percentage of trials that converged, excepting lines individually labeled, is shown above each group. 0.005, 0.01, and 0.02, while 90% of the trials converged to an error of 0.05. For precisions of 14-17 bits, nonconvergence was due to network instability. For precisions of 11-12 bits, nonconvergence was caused by learning that stalled before reaching the pertinent error criterion. Overall, the best network performance occurred with a precision of 13 bits and was comparable to the performance obtained at 17 bits with the optimal qs (factor of 2 smaller for the hidden layers) for that precision. Figures 4 and 5 show simulation results for weight truncation and output quantization with graphs of the same type as Figure 3 and using the same q and gain values. Nonconvergence was due to the same problems as above. Reasonably consistent performance was obtained down to 8 bits of weight truncation. At 7 bits only 10% of the trials could reach the lowest error criterion. At 5 bits the network happened to find a good minima and converged rapidly to the lowest criterion during one trial. Interestingly, performance with output quantization (Fig. 5) does not seem to suffer from network instability. Convergence was again
Paul W. Hollis et al.
370
WEIGHT TRUNCATION--innumber of bits
Figure 4: Minimum, average, and maximum convergence times in epochs for 20 trials at each weight truncation value. consistent down to 8 bits. At 7 bits no convergence occurs at the lowest criterion. These results demonstrate that the precision for feedforward calculations can be much lower than that used for the weight change calculations. Weight values can be stored in registers that are k bits long with only the most significant n < k bits used to perform the feedforward calculations. The weight value used in the feedforward calculation will not be affected until the accumulated weight changes become large enough to flip the least significant bit of the truncated weight value. Next, trials were performed combining lower precisions for all of the constraints. Figure 6 shows typical learning curves for three different situations: (1)all constraints at maximum precision, (2) weight precision at 13 bits, weight truncation at 7 bits, and output quantization at 7 bits, and (3) weight precision at 13bits, weight truncation at 6 bits, and output quantization at 6 bits. With the lower precision constraint combinations it was again necessary to change qs and gains to elicit good performance. The lowest precision case used qs up to an order of magnitude larger than those used in the maximum precision case. Judging from the output curves of Figure 2, all three cases exhibit reasonable performance. It
Effects of Precision Constraints
371
OUTPUT QUANTJZAlTON--in number of bits
Figure 5: Minimum, average, and maximum convergence times in epochs for 20 trials at each output quantization value. seems that lower precision learning can be characterized as finding a "decent" minimum relatively quickly and staying there forever, although occasionally a very good minimum is found. Another series of simulations with the same conditions but incorporating a different network and a different problem (recovering a pulse train in noise) yielded results that correlate very closely with these. Another author (Gilbert 1988) conducted a similar study based upon problems where correct classification, not the actual target /output error, was the important metric. In this case, lower precisions are possible since higher error can be tolerated and because the inputs and targets can assume only binary values. Finally the network size and how it affects low precision performance was also examined. Both hidden layers were independently increased and decreased in size and trials varying the weight precision were performed on each unique network. When either hidden layer is doubled in size no significant differences in the convergence characteristics are noticed. The case where the second hidden layer was decreased from 5 to 3 nodes particularly stands out. Down to 17 bits, performances of the two networks were almost identical implying that the smaller network is
Paul W. Hollis et al.
372
1.0-
Consmjnt Prtzisions:
- maxlmaxhnax ....... 1 3 1 7 l l __--_131616
0.001 -
o.ooo1 0
20
40
60
80
100
120
140
io
EPOCHS--SO data points per epoch
Figure 6: Typical learning curves for three combinations of the precision constraints. Legend shows precisions in the form of weight precisiodweight truncation/output quantization. sufficient for the problem. However from 16 bits on down the performance of the smaller network was worse. This seems to be evidence that an oversized network can help overcome weight precision constraints. 7 Conclusions
This study demonstrates that learning is possible in a hybrid analog/digital backpropagation network with low precision analog circuits. A specific architecture and a case study problem are described that allow precision constraints imposed by the architecture to be examined. Results show that it is not necessary to have high precision floating point arithmetic and operands to digitally implement the backpropagation algorithm. By individually adjusting learning rates and neuron gains at each layer, analog neurons with &bit digital weights, 13-bit weight update arithmetic, and &bit quantization of neuron outputs were used to consistently train a function-learning network to demanding error criteria.
Effects of Precision Constraints
373
The final determination of the precision required for a given problem essentially depends on the amount of output error that can be tolerated. Acknowledgments This work is supported by funding from the Office of Naval Research under Grant N00014-89-J-1461.
References Gilbert, S. L. 1988. Implementing artificial neural networks in integrated circuitry: A design proposal far back-propagation. Tech. Rep. 810, MIT Lincoln Laboratory. Hollis, P. W., and Paulos, J. J. 1990. Artificial neural networks using MOS analog multipliers. l E E E J. Solid-state Circuits 25, 849-855. Lippman, R. P., and Beckman, P. 1989. Adaptive neural net preprocessing for signal detection in non-gaussian noise. In Advances in Neural Information Processing Systems I, D. S. Touretzky, ed., pp. 124-132. Morgan Kauffmann, San Mateo, CA. Paulos, J. J., and Hollis, P. W. 1988a. A VLSI architecture for feedforward networks with integral back-propagation. Presented at the First Annual Meeting of the International Neural Network Society, Boston, Massachusetts. Paulos, J. J., and Hollis, I? W. 1988b. Neural networks using analog multipliers. Proc. IEEE lnt. Symp. Circuits and Syst. 499-502. Rumelhart, D. E., Hinton, G . E., and Williams, R. J. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Vol. I, Ch. 8, pp. 323330. MIT Press, Cambridge. White, M. W., Holdaway, R. M., Guo, Y., and Paulos, J. J. 1990. New strategies for improving speech enhancement. Intl. J.Biomed. Comput. 25, 101-124.
Received 28 July 1989; accepted 22 June 90.
Communicated by Gerard Dreyfus
Exhaustive Learning D. B. Schwartz V. K. Samalam GTE Laboratories, Waltham, M A 02254 USA
Sara A. Solla J. S. Denker AT&T Bell Laboratories, Holmdel, NJ 07733 USA
Exhaustive exploration of an ensemble of networks is used to model learning and generalization in layered neural networks. A simple Boolean learning problem involving networks with binary weights is numerically solved to obtain the entropy S,, and the average generalization ability G, as a function of the size m of the training set. Learning curves G, vs rn are shown to depend solely on the distribution of generalization abilities over the ensemble of networks. Such distribution is determined prior to learning, and provides a novel theoretical tool for the prediction of network performance on a specific task. 1 Introduction Layered networks are useful in their ability to implement input-output maps y = f(x). The problem that arises is that of designing networks to implement a desired map Supervised learning searches for networks that satisfy the map f on a restricted set of points, the training examples. An outstanding theoretical question is that of predicting the generalization ability of the resulting networks, defined as the ability to correctly extend the domain of the function beyond the training set. Theoretical and predictive analysis of the performance of networks that are trained from examples are few (Denker et al. 1987; Carnevali and Patarnello 1987; Baum and Haussler 19891, in contrast to the large effort devoted to the experimental application and optimization of various learning algorithms. Such experimental results offer useful solutions to specific problems but shed little light on general theoretical issues, since the solutions are heavily influenced by the intrinsic dynamics of the chosen algorithm. A theoretical analysis based on the global and statistical properties of an ensemble of networks (Denker et al. 1987; Carnavali and Patarnello 1987) requires reliable information about such ensemble, unbiased by the peculiarities of the specific strategy adopted to search for appropriate networks within the ensemble.
i.
Neural Computation 2, 374-385 (1990) @ 1990 Massachusetts Institute of Technology
Exhaustive Learning
375
Progress in the theoretical understanding of complex systems is often triggered by intuition obtained through carefully designed numerical experiments. We have therefore chosen a Boolean classification task involving low resolution weights, which enables us to explore the network ensemble exhaustively. Such unbiased search, although hardly useful as a practical tool, is free from the constraints intrinsic to current learning algorithms. It reveals the true properties of the network ensemble as determined by the choice of architecture, and is used here to monitor, without introducing any additional bias, the evolution of the ensemble through training with examples of the desired task. The insight gained from the numerical experiments led to a theoretical analysis of supervised learning and the emergence of generalization ability, presented in Section 2 of this paper. The numerical experiments that motivated the theoretical framework are described in Section 3. An analysis of the numerical results according to the theory, as well as some applications of the theory to more realistic problems, are provided in Section 4. 2 Theoretical Framework
Consider an ensemble of layered networks with fixed architecture and varying couplings. Such ensemble is described by its configuration space {W}: every point W is a list of values for all couplings needed to select a network design within the chosen architecture. The resulting network realizes a specific input-output function, y = fw(x). For simplicity, consider Boolean functions y E ( 0 , l ) on a Boolean x E (0, l}N or real x E RN domain. A prior density po(W)constrains the effective volume of configuration space to 20
=
f dfVPo(W)
(2.1)
Regions corresponding to the implementation of the function f are identified by the masking function
Of(W)=
1 if f w = f 0 if f w # f
and occupy a volume
The specification of an architecture and its corresponding configuration space thus defines a probability on the space of functions:
D. B. Schwartz et al.
376
which results from a full exploration of configuration space. P'(f) is the probability that a randomly chosen network in configuration space will realize the function f . The class of functions implementable by a given architecture is
3 = {flPo(f)7J 0)
(2.4)
The entropy of the prior distribution (2.5) is a measure of the functional diversity of the chosen architecture. The maximum value of SO= ln(nF) ,where n~ is the number of functions in class 3, is attained when all realizable functions are equally likely, and corresponds to the uniform distribution, P'( f) = l / n F for a11 f E 3. Supervised learning results in a monotonic reduction of the effective volume of configuration space. An example I" = (x",y") of the desired function f is learned by removing from 3 every function that contradicts it. A sequence of m input-output pairs 6" = (x",y"), 1 5 LY 5 m, which are examples of f thus defines a sequence of classes of functions,
where every function f E Fm correctly classifies all of the training exa 5 rn. The effective volume of configuration space is amples I",1 I reduced to
by learning a training set of size m. The probability on the space of functions is modified by learning and becomes
(2.7) P,(f) is the probability that f has not been eliminated by one of the m examples and is thus a member of Fm. The total volume of configuration space occupied by functions f E Fn2is 2,. The entropy of the posterior distribution, (2.8) {f }
reflects the narrowing of the probability distribution: S, < So. The entropy decrease q, = Sm-l - S, defines the efficiency of learning the mth example. Learning corresponds to a monotonic contraction of the effective volume of configuration space: Zm _C Zm-l. Exhaustive learning, as defined
Exhaustive Learning
377
here, leads to the complete exclusion of networks incompatible with each training example. Such error-free learning excludes the possibility of data so noisy as to contain intrinsic incompatibilities in the training set. A recent extension of the theory (Tishby et al. 1989) provides the tools to analyze the case of learning with error. The entropy decrease (So- S,,,) is the information gain, that is, the information extracted from the examples in the training set. The residual entropy S,,measures the functional diversity of the ensemble of trained networks. The optimal case of S,,, = 0 corresponds to the elimination of all ambiguity about the function to be implemented. In general S, # 0, and its value measures the lack of generalization ability of the trained networks. A more detailed description of the generalization ability achieved by supervised learning is based on the generalization ability g(f) of the individual functions f E 3,defined as the probability that f will correctly classify a randomly chosen example of the desired function f. As an illustration of the intrinsic ability of f to reproduce f, consider the simple case of a Boolean function from N inputs onto 1 output. The function f is specified by 2N bits, indicating the output for every possible input. In this case (2.9) where d ~ ( f , f )is the Hamming distance between f and f , that is, the number of bits by which their truth tables differ. The survival probability P,(f)can be expressed recursively by noting that the probability of surviving a single additional example is on average just g(f). Thus (2.10)
where the denominator is required to maintain proper normalization. The recursion relation equation 2.10 is based on the assumption that g(f) is independent of m, and thus it is valid provided m remains small compared to the total number of possible inputs {x}. Such limitation is not severe: learning experiments are of interest when the network can indeed be trained with a set of examples that is a small subset of the total space. The generalization ability of trained networks is an ensemble property described by the probability density (2.11)
D. B. Schwartz et al.
378
The product pm(g)dg is the probability of generating networks with generalization ability in the range [ g ,g + dg] by training with m examples. The average generalization ability (2.12)
given by (2.13)
is the probability that a randomly chosen surviving network will correctly classify an arbitrary test example, distinct from the m training examples. The recursion relation equation 2.10 for Pm(f) can be rewritten as (2.14)
and-substituted onto equation 2.11 to yield 1
(2.15)
or (2.16)
The recursion relation equation 2.16 is a crucial result of this theoretical analysis, since it provides a fundamental tool to both analyze and predict the outcome of supervised learning. Iterative applications of equation 2.16 lead to the relation (2.17)
The probability density p m ( g ) is thus fully determined by the initial distribution po(g). Its average value G, (equation 2.12), given by (2.18)
is simply the ratio between the ( m+ 1)th and the mth moments of p o ( g ) , and can be computed if po(g) is given or estimated. The entropy S, (equation 2.8) and average generalization ability Gm (equation 2.12) are the fundamental tools to monitor the learning process. The picture that emerges is that of learning as a monotonic decrease of
Exhaustive Learning
379
the effective volume of configuration space, measured by a monotonic entropy decrease with increasing m. The contraction is not arbitrary: it emphasizes regions of configuration space with intrinsically high generalization ability. The iterated convolution with g to obtain p m ( g ) from po(g) (equation 2.17) results in an increasing bias toward g = 1, and a monotonic increase of the average generalization ability with increasing m.
3 Numerical Experiments Consider a layered network with L levels of processing. The network architecture is specified by the number { N ! } , 0 5 !5 L of units per layer, and its configuration by the weights {W:’} and biases {W,’e’}for 1 5 e 5 L, 1 5 i 5 Ne, 1 5 j 5 The configuration space {W}, of dimensionality Dw = Nl(1 NtP1),describes a canonical ensemble of networks with fixed architecture and varying couplings. Full explorations of configuration space are in general impractical due to the vast number of possible networks in { W} and the correspondingly large number nF of realizable functions. Statistical sampling techniques are thus needed to extract reliable information on the prior distributions Po(f) and po(g). Simplified problems with restricted architectures and binary weights W::) = f l result in ensembles amenable to exhaustive exploration. Ensembles containing about a million networks have allowed here for the accurate computation of various ensemble averaged quantities, and led to the theoretical insight described in the preceding section. Consider the contiguity problem (Denker et al. 1987; Solla 1989), a classification of binary input patterns x = (XI,.. . ,ZN), 2, = 0 , l for all 1 5 i 5 N . Periodic boundary conditions are imposed on the N-bit input vectors, so that the last bit is adjacent to the first. The patterns are classified according to the number k of blocks of 1’s in the pattern. For example, for N = 10, x = (1110001111) corresponds to k = 1, x = (0110110111) to k = 3, and x = (0010011111) to k = 2. The task is simplified into a dichotomy: the two categories correspond to k 5 ICo and k > ko. This problem can be solved by an L = 2 layered network (Denker et al. 1987) with NO= Nl = N and N2 = 1, and receptive fields of size 2. In the numerical results reported here all processing units are thresholding units: their output is 1 or 0 according to whether their input is positive or negative. The bias W(2’1of the output unit, the biases W(l)zof the hidden units, and the weights W(2’1zbetween hidden units and output unit are fixed at the values determined by the solution to the contiguity problem for ICo = 2 W(2’1 = -2.5, and W(l’z= -0.5, W(2)lz= 1 for all 1 5 i 5 N. The only degrees of freedom are thus the couplings between input units and hidden units. A receptive field of size 2 corresponds to
xi=, +
380
D. B. Schwartz et al.
the only nonzero couplings being W(')2,2 and W(l)z,z-l,providing input to each hidden unit 1 5 i 5 N from two input units: the one immediately below, and the adjacent one to the left. The configuration space corresponds to W(ljZ3= f l for j = 2, i - 1 and 1 5 i 5 N . Even for such a simple example the configuration space is large: it consists of 22Ndistinct points. Two of them correspond to equivalent solutions to the contiguity problem: W(1)z,2 = +1, W(1)z,2-l= -1, based on left-edge detection; and W(l)r,z = -1, W(l)z,z-l= +1, based on right-edge detection. The degeneracy of the remaining (22N - 2) networks, that is to say to which extent they implement distinct functions, has not been investigated in depth. The learning experiments are performed as follows: an explicit representation of the ensemble is constructed by listing all possible ZZN networks. To generate a training set, randomly distributed examples within the 2N points in input space are obtained by blocking in groups of N bits the output of a high quality random number generator. A training set is prepared by labeling subsequent examples (x",y"). The ath example is learned by eliminating from the listing all the networks that misclassify it. The entropy S, is estimated by the logarithm of the number of surviving networks. The number of surviving networks is an upper bound to the number of surviving functions, and the two quantities are monotonically related. The average generalization ability G, is computed by testing each surviving network on a representative set of examples not included in the training set. The size of the testing set is chosen so as to guarantee a precision of at least 1%in the determination of G,. Results reported here correspond to N = 9, 10, and 11. Smaller values of N yield poor results due to limits in the available number of examples: there are only 256 possible inputs for N=8. Values of N larger than 11 exceed reasonable requirements in computer time and memory, even on a 64-Mbyte machine capable of 5 x lo7 connections/sec. , and the prediction error Em = 1 - G, as a Curves for the entropy S function of the size m of the training set are shown in Figure 1 (for N=9) and Figure 2 (for N=9 and ll),respectively. The curves are averages over 1000 separate runs, the runs being distinguished by different sequences of training examples. The prior distribution of generalization abilities po(g) is computed by testing all networks in the initial list on a randomly chosen set of 300 examples, large enough to obtain the intrinsic generalization ability of each network with a precision of at least 6%. The accumulated histograms are shown in Figure 3 (for N = 9 and 11). The dependence of the average generalization ability G, on the number m of training examples can be predicted from po(g) according to equation 2.18. The predicted curve for N=ll is shown in Figure 4, and compared to the curve computed through direct measurement of the average generalization ability. Discrepancies are to be expected, since uncertainties in the estimation of
Exhaustive Learning
0
381
40
20
60
80
Examples
Figure 1: Numerical estimate of the ensemble entropy S, as function of the size m of the training set for the contiguity problem, N = 9. The entropy is computed as the logarithm of the number of surviving networks.
0
20
40
60
80
100
Examples
Figure 2 Numerical evaluation of the prediction error ,& as function of the size m of the training set for the contiguity problem, N=9 and 11.
D. B. Schwartz et al.
382
0
,, 0.0
0.2
0.4
0.6
0.8
1.o
9
Figure 3: Initial distribution po(g) for the generalization ability of the chosen network architecture to solve the contiguity problem, N=9 and 11. po(g) affect the prediction of G,.
Lack of accuracy in the determination of g for the individual networks results in a systematic broadening of po(g) and overestimation of the prediction error ,€ = 1 - G,. A more detailed analysis of such effects will be reported in a subsequent paper (Samalam and Schwartz 1989). 4 Discussion of Results
Results for the ensemble entropy S , and the generalization ability G, shown in Figures 1 and 2 as function of the size rn of the training set confirm that supervised learning results in a monotonic decrease of the ensemble entropy and the prediction error. The rate of entropy decrease vm = S,-l -S, measures the information content of the mth training example. The continuous decrease in the slope of the entropy in Figure 1 indicates that the effective information content of each new example is a decreasing function of m. The early stages of learning rapidly eliminate networks implementing functions with a very low intrinsic generalization ability g ( f ) . Such functions can
Exhaustive Learning
383
rneasu red
0
20
40
60
80
Examples
Figure 4: Prediction error En, = 1 - G, as function of the size m of the training set for the contiguity problem, N = 11. The numerical result of Figure 2 is compared to the prediction resulting from applying the recursion relation equation 2.18 to the initial histogram of Figure 3.
be eliminated with a small number of examples, and learning is very efficient. As learning proceeds, the surviving functions are characterized by g(f) close to one. Such functions require a large number of examples, of order (1- g)-', to be eliminated. The decrease in learning efficiency is intimately tied to the decrease in prediction error: an additional example carries new information and results in further entropy reduction to the extent to which it is unpredictable (Tishby et al. 1989). The monotonic decrease of the prediction error € , with m shown in Figure 2 is characterized by an exponential tail for sufficiently large m. Such exponential tail has also been observed in learning experiments on one layer ( L = 1) networks using gradient descent (Ahmad and Tesauro 1989). The theoretical formalism presented here predicts such exponential decay for the learning of any Boolean function. Consider the case of Boolean functions from N inputs onto 1 output. There are 2N possible inputs, and the intrinsic generalization ability can only be of the form gv = r / 2 N ,with r an integer in the range 0 5 r <_ 2 N . Then
D. B. Schwartz et al.
384
where p, is the probability of g = gT. The average generalization ability of equation 2.18 is easily computed for a density of the form equation 4.1:
and it is dominated at large m by the by the two largest values of T for which p, # 0. If g=1 is attainable with probability p, and the next highest value g = 1 - i is attainable with probability Q, then for large m ,€ = I - G,
N
4 -
indicating an exponential decay of the form g
M
(4.3)
P
with mi1
C - ~ ~ W ,
=
-In
i.
The parameter rno controlling the rate of exponential decay is inversely proportional to the gap i between g=l and g = g. If i 4 0 the exponential decay is replaced by a power law of the form (4.4)
Such asymptotic form follows from the moment ratio equation 2.18 for G, whenever po(g) (1 - g)"o as g + 1 (Tishby et al. 1989). As a simple example of the continuous case, consider learning to separate points in RN with a plane through the origin using an L = 1 network. Restricting the weights to the unit sphere results in an initial distribution of the form N
po(g)
sinN-2(v)
(4.5)
as follows from the Jacobian of a spherical coordinate system in N dimensions. The average generalization ability equation 2.18 is computed to be
with mo controlled by the dimension N of the input. It is intuitively obvious that the outcome of supervised learning is hard to predict, in that the dependence of the generalization ability of a trained network on the number of training examples is determined by both the problem and the architecture. The fundamental result of this paper is to demonstrate that knowledge of the initial distribution po(g) suffices to predict network performance (equation 2.18). The specific details of the chosen architecture and the desired map y = J(x)matter only to the extent that they influence and determine po(g). The asymptotic form of learning curves & , vs. m is controlled by the properties of po(g)
Exhaustive Learning
385
close to g=1: the existence of a gap results in exponential decay, while the continuous case leads to power-law decay. The approach is based on analyzing the statistical properties of an ensemble of networks (Gardner 1988) at fixed architecture. In contrast to more general analysis based on the VC dimension of the network (Baum and Haussler 1989; Devroye 19881, which produce bounds on the prediction error, the performance of the ensemble is evaluated here in reference to a specific task. The question being asked is not how difficult it is to train a given network architecture in general, but how difficult it is to train it for the specific task of interest. It is in this ability to yield specific predictions that resides the potential power of the method.
Acknowledgments One of the authors (D. B. S.) would like to acknowledge helpful discussions with R. L. Rivest, s. Judd, and 0. Selfridge, as well as the support of AT&T Bell Laboratories where this work was begun.
References Ahmad, S., and Tesauro, G. 1989. Scaling and generalization in neural networks: A case study. In Advances in Neural Network Information Processing Systems I, D. S. Touretzky, ed., pp. 160-168. Morgan Kaufmann, San Mateo. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1,151-160. Carnevali, P., and Patarnello, S. 1987. Exhaustive thermodynamic analysis of Boolean learning networks. Europhys. Lett. 4, 1199-1204. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., and Hopfield, J. 1987. Automatic learning, rule extraction, and generalization. Complex Syst. 1, 877-922. Devroye, L. 1988. Automatic pattern recognition, a study of the probability of error. l E E E Trans. PAh4I 10, 530-543. Gardner, E. 1988. The space of interactions of neural network models. J. Phys. A 21, 257-270. Samalam, V. K., and Schwartz, D. 8.1989. A study of learning and generalization by exhaustive analysis. GTE Laboratories Tech. Rep. TM-0224-12-89401. Solla, S. A. 1989. Learning and generalization in layered neural networks: The contiguity problem. In Neural Networks: From Models to Applications, L. Personnaz and G. Dreyfus, eds., pp. 168-177. I. D. S. E. T., Paris. Tishby, N., Levin, E., and Solla, S. A. 1989. Consistent inference of probabilities in layered networks: Predictions and generalization. In IJCNN lnternational Joint Conference on Neural Networks, Vol. 11, 403-409. IEEE, New York. Received 7 September 1989; accepted 2 May 1990.
Communicated by Richard Lippmann
A Method for Designing Neural Networks Using Nonlinear Multivariate Analysis: Application to Speaker-Independent Vowel Recognition Toshio Irino Hideki Kawahara NTT Basic Research Laboratories, 3-9-11 Midori-cho Musashino-shi, Tokyo 180, Japan
A nonlinear multiple logistic model and multiple regression analysis are described as a method for determining the weights for two-layer networks and are compared to error backpropagation. We also provide a method for constructing a three-layer network whose semilinear middle units are primarily provided to discriminate two categories. Experimental results on speaker-independent vowel recognition show that both multivariate methods provide stable weights with fewer iterations than backpropagation training started with random initial weights, but with slightly inferior performance. Backpropagation training with initial weights determined by a multiple logistic model after introduction of data distribution information gives a recognition rate of 98.2%, which is significantly better than average backpropagation with random initial weights.
1 Introduction Error backpropagation (BP) (Rumelhart and McClelland 1986) is widely used as a training algorithm for neural networks. The BP algorithm is a steepest descent algorithm, where the weights change according to the local gradient of the weight space. As a result, BP with random initial weights usually requires considerable computation time for convergence and the resulting solution found may be only a local minimum. It may therefore be necessary to set good initial conditions to prevent poor solutions due to local minima. These initial conditions can be based on the input data distribution. The linear perceptron has been shown to correspond to either Iinear multiple regression analysis or discriminant analysis (Gallinari et al. 1988; Rumelhart and McClelland 1986). This implies that BP networks with nonlinear units are related to nonlinear multivariate analysis (Asoh and Otsu 19891, and that one way to introduce information about the input data distribution into the network is to use categorical discrimination. It has been reported that initial weights between the input and Neural Computation 2, 386-397 (1990) @ 1990 Massachusetts Institute of TechnoIogy
Designing Neural Networks
387
middle layers of BP networks can be determined by linear multiple regression analysis (MRA) (Rao 1973; Yanai and Takagi 1986) if the middle and output units are in one-to-one correspondence (Ri and Itakura 1988). However, in that study, the weights between the middle and output layers were determined ad hoc without using M U . Network performance was improved by the BP algorithm, but did not exceed the results for BP networks with random initial weights BP(random). This may indicate that linear MRA cannot sufficiently approximate nonlinear BP networks. This paper presents a method for designing artificial neural networks using a multiple logistic model (MLM) (Yanai and Takagi 1986), or logistic regression (Anderson 1982), which is a kind of nonlinear multivariate analysis (Irino and Kawahara 1989). The nonlinearity used in this MLM is the sigmoid or logistic function used in conventional BP network units. This method can be used in a systematic procedure for introducing distribution information on input data into neural networks. 2 Multivariate Analysis
2.1 Multiple Logistic Model. The multiple logistic model (MLM) is used as a discriminant method to separate data into two categories. It is obtained by logit transformation (Ashton 1972) as described below and is easy to treat mathematically.
2.2.2 Multiple Logistic Function. If probability P,(E) of an event E varies with explanatory variable vector x = (1,51,x2,.. . ,xp)T and the logarithmic value of the ratio of conditional probability P,(Elx) and 1 P,(Elx) can be explained as a linear combination of x and parameter w, then the relationship is
Solving this equation for P,(EJx),a multiple logistic function is obtained. (2.2)
This equation is exactly the same as the output function of a unit in conventional BP networks. Figure l a shows the correspondence between a multiple logistic model and a unit used in a typical BP network. Parameters w of a multiple logistic model (MLM) can be obtained by the maximum likelihood method. Therefore, we can use this method to determine the weights of networks, which we expect will approximate BP networks.
Toshio Irino and Hideki Kawahara
388
lnput information
lnpuf lnformafion
Figure 1: (a) Correspondence between a multiple logistic model and a neural unit. (b) Network without middle layer units. Weights are determined by multiple logistic model (MLM), multiple regression analysis ( M U ) , and backpropagation (BP). (c) Three-layer neural network. Middle layer targets are set to separate two proper sets of output categories only when using MLM or MRA.
2.1.2 Likelihood Function. Let the kth input vector be xk, and the parameter vector be w. Assume the kth output value of a unit P k is the conditional probability P,(Elxk). Let the value of yk be 1 when event E occurs, and 0 when event E does not occur. Then logarithmic likelihood logL(w) is defined as M
logL(w) = log M
=
L*Pp
c{Yklogpk
. {l - Pk}'-yk
+ (1
-
1
Yk)log(l - Pk))
(2.3)
k=l
where M is the number of input samples. The maximum likelihood estimator of w that maximizes equation 2.3 was solved by the NewtonRaphson method to accelerate convergence over the steepest descent. In this method, w at iteration t + 1 is determined from w at the previous iteration t by W(t+l)
= ~ ( t-)Aw(t)
Designing Neural Networks
389
Here, the first and second derivatives of logL(w) are
and (2.6) The criterion for convergence is a very small ratio of weight update ( ( A W ( ~ ) / W ( The ~ ) ( ~value . of &? can be interpreted as the target value of a unit, so it can have an arbitrary value between 0 and 1 despite the above definition. In the following experiments, Y k is 0.9 for an activated target and 0.1 for a deactivated target, as in the usual BP procedure. The logarithmic likelihood function corresponds to information entropy or mutual information (Baum and Wilczek 1987; Bourlard and Wellekens 1988; Sola et al. 1988) and can also be used as an error measure for the BP algorithm. 2.2 Multiple Regression Analysis. Multiple regression analysis (MRA) can also be used to determine the weights between input and output layers in a network with no hidden units. To make this linear multivariate analysis applicable to nonlinear units, the kth target value for a unit (ya) is converted into an "input" target value using an inverse sigmoid function, f-*(yk). The weight vector w is given as a solution of the normal equation (Rao 1973)
(X'X)W = X T z
(2.7)
where X = ( x l , x z , . . . , x M ) ~and z = [f-'(yl), f-'(y2), . . . , f - ' ( y ~ ) ] ~ If. X T X is regular, then w = ( X T X ) - ' X T z , otherwise w = ( X T X ) - X T z , using a generalized inverse matrix ( X T X ) - . 3 Experiment
3.1 Input Data. The isolated speech sounds of the five Japanese vowels spoken by 40 male and 40 female speakers (from 14 to 61 years old) were divided into two sets of equal size, one for training and one for the open test. To calculate the inputs for the neural networks, the speech data were analyzed by a fluid dynamics cochlear model (Irino and Kawahara 1988) which calculates the basilar membrane displacements of 70 sections along the membrane. The log-scaled amplitudes of the displacements averaged over 30 msec were calculated. Then, a cosine expansion along the membrane was performed to obtain the lowest 20 spatial frequency components of the basilar membrane displacement pattern. Thus, the coefficients correspond to the "mel-cepstrum," or the cepstrum coefficients
Toshio Irino and Hideki Kawahara
390
of the mel-scaled frequency spectrum. The coefficients were calculated at five positions randomly extracted from the speech of a single speaker. Therefore, both training and open sets have 1000 samples [= 5 vowels x (20 male + 20 female) x 5 positions]. 3.2 Two-Layer Networks. The weights of two-layer networks were determined by MRA, MLM, and BP with two different convergence criteria. These methods are described below. Figure l b shows a network with 20 input and 5 output units, corresponding to the 20 input coefficients and the five vowel categories. The number of iterations, recognition rates, and average squared error between output activation vahes and the target values for these methods are listed in Table 1.
3.2.2 Multiple Regression Analysis (MRA). The weight vector is determined uniquely by MRA using equation 2.7 with no iteration. The
Method
MRA MLM BP,(v = 0.25) Maximum Minimum BP,(v = 0.25) Maximum Minimum BPc(v = 5.0) Maximum Minimum BP,(q = 5.0) Maximum Minimum
Iterations Recognition rate (%)
Squared error
Closed
Open
Closed Open
5* 29*
95.6 95.9
93.9 94.3
0.0210 0.0174
0.0236 0.0209
2000 936
99.1 98.4
95.3 95.0
0.0121 0.0115
0.0186 0.0181
611t 487t
97.9 97.4
95.2 94.9
0.0129 0.0126
0.0181 0.0179
1618 921
99.1 99.1
94.2 94.1
0.0111 0.0110
0.0265 0.0250
261 13f
98.3 97.9
94.4 94.1
0.0160 0.0152
0.0200
0.0194
Table 1: Experimental Results for a Five-Vowel Recognition Task for Various Methods of Determining Weights in a Two-Layer Network. The maximum and minimum values are for 10 BP trials that started from 10 different sets of initial weights. Asterisks (3 show the number of matrix calculations needed to determine the weights. These numbers differ in meaning from the BP iteration counts. The average time of the matrix calculation per count was almost double that of one BP iteration in these experiments. Daggers (t) show the number of BP iterations at minimum squared error point for the open test. Several or some tens of additional iterations are required to find it.
Designing Neural Networks
391
recognition rates and the average squared errors were the worst of all. However, computational time was the shortest in this experiment.
3.2.2 Multiple Logistic Model (MLM). This method applies the NewtonRaphson method, using a likelihood estimator. It always gave the same weights after five or six iterations per unit, starting from different small initial random weights. Negative values of the second derivative of the likelihood estimator in equation 2.6, -8log L(w)/aw2, always give a positive-semidefinite matrix regardless of the input vectors x. Therefore, the Newton-Raphson method with likelihood estimator is efficient in the determination of weights between two layers. The results were slightly better than MRA, but worse than BP. The computational time was much less than BP using the same convergence criterion. 3.2.3 Newton-Raphson Method with Squared Error Estimator. This method produced no useful results because the weight values always diverged. The second derivative of the squared error function Err(w)[= 1/2 ( Y k - Pk)’]is 8Err(w) dW2
M
= - C { 3 P k 2- 2(1 k=l
+
+
~ k ) P k ~ k } P k ( 1-
(3.1) P~)X~X;
which is more complex than equation 2.6 and not always semidefinite. Therefore, the least mean square estimation is less stable than the maximum likelihood estimation for determining weights between two layers by the Newton-Raphson method.
3.2.4 Backpropagation Algorithm. Networks without hidden units were trained by error backpropagation (BP), or generalized delta rule, from random initial weights. Two convergence criteria can be applied in this method. One is to watch the ratio of weight updates as used in MLM. This is equivalent to checking the change of squared error for the training set (BP, in Table 1).The other criterion is to find the point of the minimum squared error for the open test at every iteration (BP,). Table 1 shows the results of two different proportional coefficients of weight update 77 (Rumelhart and McClelland 1986). The iterations were limited to 2000. The results changed gradually between the two 77 values and did not improve for 77 > 5.0. When 77 is small, the recognition rates and squared errors for the open test of both BP, and BP, are better than those of MRA and MLM, but require much more computation. When 77 is large, the rates of both BP, and BP, are almost the same as for MLM. The number of iterations of BP, and MLM are similar if several or some tens of additional iterations to find the minimum point are taken into account. Therefore, the error backpropagation gives better results than MLM only when 77 is small; however, it takes more iteration steps.
392
Toshio Irino and Hideki Kawahara
3.3 Designing a Three-Layer Network
3.3.1 Combinatorial Internal Representation. Let us consider construction of three-layer networks that include one middle layer without backpropagation. Three-layer networks can discriminate categories with complex boundaries by combining the semilinear discrimination functions performed by single units. When we do not have a priori knowledge, the easiest way to put as many hyperplanes as possible into the space is to make the middle layer discriminate all possible combinations that separate all the elements of the output categories into the two proper subsets. That is to say to make combinatorial internal representations. The integration of these semilinear discrimination functions is done between the middle and output layers. Weights between two layers can be determined by any of the methods shown in Table 1. However, BP does not give the same weights each time and takes many more iterations to achieve better results than MLM. Therefore, MRA and MLM, which are stable multivariate analyses, were used in the following procedure to reduce computation, even if at the cost of performance. Figure l c shows a three-layer neural network with 15 middle units designed by MRA or MLM. More generally, the number of combinations, or middle units, is ZN-’ - 1, where N is the number of output categories. Middle layer units 1 to 5 are activated when parameters for a vowel (/a/, /i/, /u/, /e/, or / o / ) are fed into the input layer. This means that the kth target value for middle unit 1 (/a/) is “activated” (yh = 0.9) when the kth input vowel is /a/ and ”deactivated” (yk = 0.1) otherwise. Middle layer units 6 to 15 correspond to combinations of vowels (/a,i/, / a d , . . ., /u,o/, /e,o/). For example, unit 6 (/a,i/) is activated when vowel /a/ or /i/ is presented to the input layer and deactivated when /u/, /e/, or / o / is presented. The weights between the input and middle layers are determined using MRA or MLM to activate each middle layer unit to satisfy these conditions. After the middIe layer activation pattern for each set of input data is derived, the weights between the middle and output layers are determined by MRA or MLM. 3.3.2 Results. Table 2 shows results for networks with 15 middle units. MRA and MLM show networks with combinatorial internal representation designed using MRA and MLM as described above. BP shows networks with 15 hidden units trained by error backpropagation from 20 different sets of random initial weights. We will call them BP(random1 networks for simplicity. All the BP training in this and the next section was performed under 11 = 0.25 and a squared error minimization convergence criterion for the open test. MRA+BP and MLM+BP show the results for error backpropagation applied after the weights were determined by MRA and MLM.
Designing Neural Networks
Method
MRA MLM MRA+BP MLM+BP BP Average SD Maximum Minimum
393
Iterations Recognition rate (%)
20* 115' 301 268 1420 570 2000 867
Squared error
Closed
Open
Closed
Open
97.2 98.6 100.0 100.0 99.95
95.2 96.1 97.1 97.5 97.34 0.46 98.0 96.5
0.0095 0.0068 0.0027 0.0027 0.0024 0.00065 0.0041 0.0015
0.0140 0.0122 0.0090 0.0092 0.0093 0.00065 0.0110
-
100.0 99.7
0.0077
Table 2: Experimental Results for a Five-Vowel Recognition Task for Various Methods of Determining Weights in a Three-Layer Network. MRA and MLM represent networks having 15 middle units with combinatonal internal representation designed using MRA and MLM. MRAfBP and MLM+BP represent networks to which error backpropagation was applied after weights were determined by MRA and MLM. BP shows networks with 15 hidden units trained by error backpropagation from 20 different sets of random initial weights. Asterisks (*) show the number of matrix calculations needed to determine the weights. The average time of the matrix calculation per count is almost the same as that of one BP iteration in these experiments.
The MLM network performs better than the MRA network. This may be because the nonlinear function of each unit is included in determining the weights by MLM. The MRA and MLM networks have lower recognition rates than the BP(random). However, the MRA and MLM networks require only 1/5 and 1/28 of the minimum number of iterations for BP(random). This indicates that MRA and MLM are good procedures for designing neural networks in a few computational steps without using BP. The recognition rate for the open test of MLM followed by BP (MLM+ BP) is better than MRA followed by BP (MRASBP) and is similar to the average rate for BP(random). The total iterations for MLM+BP (115f268) and MRA+BP (20 + 301) are less than half the minimum iterations for BP(random). Therefore, these combinations can effectively reduce the iterations required. 3.4 Introduction of Knowledge. Knowledge or information about data characteristics can be easily introduced into networks by MRA or MLM. It is inefficient to use 2('O-') - 1 (= 511) middle units to accommodate possible combinations of five male and five female vowels. Relationship knowledge for the formant frequencies of male and female vowels is introduced to determine unit functions (Fig. 2). Five new units were added to the middle layer in this experiment. The function of each
Toshio Irino and Hideki Kawahara
394
3000
I
2500
=
n
N
cv L
2000
1500
1000
200
I
400
I
600
800
1000
F1 (Hz)
Figure 2: The relationship between the first and the second formant frequencies (F1 and F2) of Japanese vowels. The vertices of the solid and dashed polygons are located at average formant frequenciesof male and female vowels. The thick solid lines represent the functions of the middle units (MU), which separate the five male and five female vowels into two groups. unit is to separate the complete set of male and female vowels into two groups, as illustrated by the thick solid lines in the figure. The functions of the middle units (MU) are 0
MU1 - MU15 as shown in Figure lc.
0
MU16 discriminates male from female.
0
MU17 discriminates male /a,o/ and female / o / from others.
0
MU18 discriminates female /i/ from others.
0
MU19 discriminates male /i/ and female /i,e/ from others.
0
MU20 discriminates female /a,i,e/ from others.
Table 3 shows the results for networks with the 20 middle layer units. The recognition rates for MRA and MLM networks are slightly higher
Designing Neural Networks
Method
MRA MLM MRAfBP MLM+BP BP Average SD Maximum Minimum Number
t test t(19) Probability
395
Iterations Recognition rate (%)
25* 138* 343 495 1442 478 2000 432 1 8.86 0.0001
Squared error
Closed
Open
Closed
97.3 98.7 100.0 100.0 99.99
95.6 96.2 97.6 98.2 97.47 0.47 98.3 96.7 1
0.0091 0.0135 0.0065 0.0118 0.0026 0.0082 0.0022 0.0084 0.0020 0.0089 0.00057 0.00050 0.0038 0.0099 0.0015 0.0079 14 2
-
100.0 99.7 -
-
-6.88 0.0001
-1.07 0.297
Open
4.50 0.0002
Table 3: Experimental Results for a Five-Vowel Recognition Task for Various Methods of Determining Weights in a Three-Layer Network Having 20 Middle Units. Middle units have combinatorial internal representation and introduced knowledge representation when using MRA and MLM. MRA+BP, MLM+BP and BP are the same as in the previous table. (Num.) shows the number of these BP(random) networks which exceeded MLM+BP in the performance of each column. t test t(19) and (probability) show t values and t test probabilities between average BP(random) and MLM+BP results. than for the 15 middle unit networks in the open test. The rates for the MLM networks with and without BP training were always better than MRA. The MLM networks with BP training reached a recognition rate of 98.2%. Results of a t test between MLMfBP and BP(random1show that MLMfBP performance was significantly better than the average rate for BP(random). Just one or two of the 20 BP(random) networks performed better than MLMtBP in the open rate or squared error. Several trials of BP training from different random initial weights may be necessary to obtain the best network, thus the total number of iterations becomes much larger for BP training than for MLM+BP. Careful selection of pattern distribution knowledge may improve the performance of MLM+BP networks. 4 Conclusion
Statistical network design methods using a multiple logistic model (MLM) and multiple regression analysis (MRA) stably gave weights with relatively few computational steps. However, they did not always yield
396
Toshio Irino and Hideki Kawahara
better performance in determining weights between two layers than error backpropagation (BP) with random initial weights. These methods are easily applied to introduce input pattern information into networks. Our experiments showed that the weight values g v e n by MLM with combinatorial internal representation and knowledge introduction effectively reduced computational iterations and provided good initial weights for error backpropagation training. The development of selection criteria for middle layer units that avoid a combinatorial explosion of all possible combinations of two proper subsets of the output categories requires further study.
Acknowledgments The authors wish to thank Kazuhiko Kakehi, Masaaki Honda, and Shin Suzuki for their valuable advice as well as the members of our laboratory for their discussions.
References Anderson, J. A. 1982. Logistic discrimination. In Handbook of Stutisfics, P. R. Krishnaish and L. N. Kanal, eds., Vol. 2, pp. 169-191. North-Holland, Amsterdam. Asoh, H., and Otsu, N. 1989. Nonlinear data analysis and multilayer perceptrons. Proc. IJCNN 89,II-411415. Ashton, W. D. 1972. The Logit Transformation. Charles Griffin, London. B a n , E. B., and Wilczek, F. 1987. Supervised learning of probability distributions by neural networks. Proc. l E E E Neural Informafion Processing Syst., pp. 52-61, American Institute of Physics. Bourlard, H., and Wellekens, C. J. 1988. Links between Markov models and multilayer perceptrons. Manuscript M 263, Philips Research Lab., Brussels. Gallinan, I?, Thiria, S., and sOuli15, E F. 1988. Multilayer perceptrons and data analysis. Proc. IEEE ICNN 88, 1-391-399. Irino, T., and Kawahara, H. 1989. A method for designing neural networks using non-linear multivariate analysis - application to speaker-independent vowel recognition. Tech. Rep. hst. Elec. Info. Comm. Eng. (IEICE), SP88-123 [in Japanese]. Irino, T., and Kawahara, H. 1988. Vowel-feature extraction from cochlear vibration using neural networks. Proc. Int. Neural Network SOC. Rao, C. R. 1973. Linear Statistical Inference and Its Applications, 2nd ed., Wiley, New York. Ri, J., and Itakura, F. 1988. Adaptation of multiple regression coefficients and recognition of Chinese four tones. Proc. Spring Meeting Jpn. Acoust. SOC., pp. 3940 [in Japanese].
Designing Neural Networks
397
Rumelhart, D. E., and McClelland, J. L. (eds.) 1986. Parallel Distributed Processing, Chap. 8, pp. 457-458, Chap. 11, pp. 319-362. MIT Press, Cambridge, MA. Solla, S. A., Levin, E., and Fleisher, M. 1988. Accelerated learning in layered neural networks. Complex Syst. 2,625-640. Yanai, H., and Takagi, H. (eds.) 1986. Handbook of Multivariate Analysis, pp. 160162,279. Gendai Sugaku Sha, Tokyo [in Japanese].
Received 16 October 1989; accepted 2 May 90.
NOTE
Communicated by Geoffrey Hinton
How to Solve the N Bit Encoder Problem with Just Two Hidden Units Leonid Kruglyak Department of Physics, University of California at Berkeley, Berkeley, C A 94720 U S A I demonstrate that it is possible to construct a three-layer network of the standard feedforward architecture that can solve the N inputN output encoder problem with just two hidden units.
The encoder problem is a simple problem that has been used to illustrate backpropagation learning (Rumelhart et al. 1986). The network consists of N input units, N output units, and M hidden units. One input unit is turned on while all others are off; the task is to turn on the corresponding output unit. In typical backpropagation simulations, log, N hidden units are used to solve the problem. I will show that a network can be constructed that performs the task with only two hidden units, hence illustrating the limitations of backpropagation and setting a goal for other learning algorithms.' The network is of the standard feedforward architecture with graded response hidden units and linear threshold output units. The outputs of the two hidden units, each running from 0 to 1, form a unit square the hidden unit space. Each output unit then divides this space into two sections by drawing a straight line across it. Points on one side are above its threshold and points on the other below. The task is to separate the unit square into regions such that only one output unit is on for hidden unit values within each region. If we can then restrict the values of the hidden units to a different region for every input pattern, the output units will be able to separate the input patterns. The construction is illustrated in Figure 1. The case of eight input and output units is drawn, but the principle is the same for any number of units. H1 and H2 are the output values of the two hidden units. Each line represents an output unit; everything below a line is above threshold for that unit. It is clear that each of the small triangles formed 'It was recently brought to my attention that a modified version of backpropagation can solve the 8-2-8 encoder, but it is doubtful that it could solve the 1000-2-1000 version. Neural Computation 2, 399-401 (1990) @ 1990 Massachusetts Institute of Technology
Leonid Kruglyak
400 1.0
I
I
HZ
0.0 0.0
1.0
HI Figure 1: Division of (H1, H2) space by eight output units.
by three consecutive lines lies below one and only one line. If each input pattern produces hidden unit values in a different triangle, which is easily arranged, the problem is solved. It should be clear that the construction of Figure 1 can be extended to any number of lines - the triangles will shrink but remain of finite area. I have verified this by explicitly constructing the required weights and thresholds for arbitrary N (L. Kruglyak, 1990, unpublished). By showing that a three-layer feedforward network can solve the encoder problem with just two hidden units, 1 have provided, at least in a simple case, the answer to the common question of the minimum number of hidden units necessary to perform a task. Similar considerations may provide answers to this question for other problems. I have also given an example of a problem for which a network solution exists but most likely cannot be learned by backpropagation. This example underscores the limitations of backpropagation learning - unlike perceptron learning, not all possible solutions are reachable. It also provides a challenge for designers of learning algorithms - invent an algorithm that enables the N-2-N network with large N to learn the encoder problem.
N Bit Encoder Problem
401
Acknowledgments I thank Terry Regier and Fred Rieke for helpful comments and the Fannie and John Hertz Foundation for fellowship support.
References Rumelhart, D. E., Hinton G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructures of Cognition, D. E. Rumelhart and J. L. McClelland, eds., Vol. 1, Chap. 8, p. 318. MIT Press, Cambridge, MA.
Received 27 August 90; accepted 19 September 90.
NOTE
Communicated by Eric Baum
Small Depth Polynomial Size Neural Networks Zoran Obradovic* Department of Cornputer Science, The Pennsylvania State University, University Park, PA 16802 U S A Peiyuan Yan Mathematics Department, Lycoming College, Williamsport, PA 17701 U S A
For polynomially bounded weights and polylogarithmic precision, analog neural networks of polynomial size and depth 3 are strictly more powerful than those of polynomial size and depth 2.
1 Introduction For many applications of neural networks it is desirable to have an architecture of as few layers as possible. Recently, a number of results (for references see, for example, Hornik et al. 1989) concerning the representational power of analog neural networks with one or two hidden layers have been obtained. Unfortunately, some significant results concerning neural networks with a single hidden layer seem to require unreasonably large numbers of computing units. Here we impose polynomial bounds on both the number of layers and the size of a neural network with respect to the length of the input. For binary neural networks Hajnal et al. (1987) showed that depthtwo networks (one hidden layer) must have exponential size to compute certain functions, and as a consequence showed that binary networks of polynomial size and depth three (two hidden layers) are strictly more powerful than those of polynomial size and depth two. We discuss a similar problem for limited precision analog neural networks. More precisely, we will show that polynomially bounded k-ary neural networks of depth three are more powerful than those of depth two. An earlier work (Obradovic and Parberry 1990a,b) showed that polynomially bounded k-ary neural networks are equivalent to limited precision analog networks. ~
*On leave from the Mathematical Institute, Belgrade, Yugoslavia.
Neural Computation
2, 402-404 (1990)
@ 1990 Massachusetts Institute of Technology
Small Depth Neural Networks
403
2 The Model and the Result
Here netiral networks are weighted directed acyclic graphs of processors. The depth of the neural network is the number of layers, and the size is the number of processors in the network. We will not count the input processors in either the depth or the size. Current literature mostly discusses binary and analog neural networks. In both models, processors compute functions of the form f(w1,. . . , w,): R" S, where S c R, E R for 1 5 i 5 n, and ---f
~1~
for some output function y : R + 5'. In the binary model, S = (0, l} and g is a linear threshold function, defined by g(z) = 1 iff z 2 t . In the analog model, S = [0,1] and g is a continuous nondecreasing function. For practical purposes the restriction of limited precision on analog neural networks appears to be a reasonable assumption (provided that the precision is not too small). It was shown earlier (Obradovic and Parberry 1990a) that an analog neural network of limited precision can be represented as a k-ary neural netzuork. In such networks processors compute functions of the form f ( w l , . . . , w,) : Z; 4 Z k (where Zk = { O , l , . . . , k - l})with w 7E Z for 1 5 z 5 n, and
for g ( t l , L2,. . . , t L - 1 ) : R i Zk. being defined as g(z) = i iff t, 5 z < t 7 + l r where t, E R are monotone increasing and to = -m, t k = +m. We define N N ; as the class of languages L c Z; that can be computed by depth d, k-ary neural networks with k , size, and weights polynomially bounded in the input length. The result of this note can be expressed as follows. Theorem 1. NN! strictly includes N N ; . This theorem is a generalization of the result of Hajnal et al. (1987) for k > 2. The proof of the theorem is omitted here and is given in Obradovic and Yan (1990). Analyzing k-ary neural networks, we are actually reasoning about the behavior of analog neural networks of precision limited to O(1ogk) bits. So, the result can be expressed as the separation of polynomial size analog neural networks of depth 3 from those of depth 2 for polynomially bounded weights and polylogarithmic precision. To conclude, we wish to observe that it is an interesting open problem whether the same is true for ail depths i + 1 and i, or hierarchy collapses at depth 3.
404
Z. Obradovic and P. Yan
References Hajnal, A., Maass, W., Pudlak, P., Szegedy, M., and Turan, G. 1987. Threshold circuits of bounded depth. In Proceedings I E E E 28th Annual Symposium on Foundations of Computer Science, 99-110. Hornik, K. M., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2,359-366. Obradovic, Z., and Parberry, I. 1990a. Analog neural networks of limited precision I: Computing with multilinear threshold functions. In Advances in Neural lnformation Processing Systems Il, D. S. Touretzky, ed., pp. 702-709. Morgan Kaufmann, San Mateo, CA. Obradovic, Z., and Parberry, I. 1990b. Learning with discrete multi-valued neurons. In Proceedings of the Seventh lnternational Conference on Machine Learning, Austin, TX, 392-399. Obradovic, Z., and Yan, P. 1990. Lower bounds for limited precision analog neural networks. Tech. Rep. CS-90-28, Dept. of Computer Science, Pennsylvania State University.
Received 29 August 90; accepted 19 September 90.
NOTE
Communicated by Eric Baum
Guaranteed Learning Algorithm for Network with Units Having Periodic Threshold Output Function Mark J. Brady 3M Software and Electronics Resource, St. Paul, M N 55144-2000 U S A
1 Introduction
Although it has been shown that multiple layer networks can, potentially, approximate any continuous function (Carrol and Dickinson 1989; Cybenko 1989; Funahashi 19891, automated learning of an arbitrary training set has not been demonstrated. Backpropagation algorithms, for example, are prone to local minima problems. This paper will discuss a model for a network unit that can be trained to compute any discrete function : R" i {0,1}. That is, given a set of m training pairs {(v,,61), . . . , (v,,,,b,,,)}, the unit can be configured to output b, whenever v, is the input, where the components of vector v, are in R and b, is in {0,1}. 2 The Unit Function as a Composite Function
p is representable as a composite function p ( v ) = T(sin[(XI/X) . u[u(v)]]}= b
where o,( ) is the usual linear activation function given by computing the inner product w . v, w being a weight vector. u( ) maps from R to R and will be explained later. For now it can be considered to be the identity. sin( ) constitutes the periodic portion of the output function. Other periodic functions could be used here but sin( ) will act as a convenient representative. X is the wavelength. T ( ) is a threshold function defined as T(X) =
1 if n: > 0 0 else
The reader may prefer to consider T[sin( )] to be a square wave function. Neurul Computation 2, 405-408 (1990) @ 1990 Massachusetts Institute of Technology
Mark J. Brady
406
3 Selection of Weights
It must be shown that a suitable u( ) can be chosen. This is equivalent to showing that we can choose a suitable weight vector w . In this model, it will be shown that only one condition can disqualify w as a valid weight vector. That condition is as follows: if there exist two input vectors v, and v, such that w . v, = w . vJ yet 6, # then w is not an acceptable weight vector. Such a w is unacceptable because w ‘v, = w ’v, implies u(v,) = u(v,), which implies p(v,) = p(v,) or 6, = b,. This contradicts b, # bJ. To find an acceptable w, let us ensure that whenever b, # b,, w . v, # w . vJ. Notice that w . v, = w . v,
is the same as w . v, - w . VJ = 0
or w . (v, - V J )= 0
An sk = (vz- v,) can be defined for each case where 6, # 4. w must then be adjusted to some w’, so that w’ . S k # 0. This can be done by setting w’ = w
For
SL
+ &Sk
satisfying
W.SL
#O
w should not be disturbed so as to result in the condition
because if w is acceptable with respect to sL we desire that it should remain so. Since there is at most one E that satisfies equation 3,l and there are finitely many sy, for which w must not be incorrectly disturbed, a suitable E can be found. The algorithm proceeds by adjusting w as described above for each s k where w . s k = 0. 4 Selection of X
The function u( ) projects each input vector onto w . The length of this projection is in R. X must be set so that u(v,) lies under a portion of the sine function, which is greater than zero whenever bi = 1 and u ( v ~ )is above a portion of the sine function, which is less than zero whenever bi = 0.
Learning Algorithm for Networks
407
Ideally, one would like a maximum of the sine curve to occur at u(v,) when b, = 1 and a minimum to occur when b, = 0. In other words, we would like to have (2n/A) . a(vJ = n/2
+n
for some integer n when h,
=
'
2n
1 and
(2n/x).a(v,) = 31312 + . zn for some integer 7~ when h, = 0. Starting with some arbitrary waveIength A*, one can adjust the wavelength for each projection a(.,). The a(v,) can first be reordered with i > 3 implying n(v,) > ~ ( v , ) .A, will be defined to be the wavelength after adjustment for a(v,). In general sin[a(v,)].f 1 or -1 as desired, for arbitrary A. However, there are values 7' such that sin[a(vl)+ 7-1 = 1 or -1 as desired. Define r, = the smallest such r
+
The number of wavelengths between a(v,) T , and zero will determine, in part, how much A,-, will need to be adjusted to form A,. Define UJ,
=
[the number of wavelengths between a(v,) and zero] =
[a(v,)+ r J / L 1
Let AA,-1 be the desired adjustment to A,-I in forming A,. A,
= A,-1
+ Ax,-,
From the definitions given so far one can deduce that
ax,-,
= -TJUJ7
=
-rJ-,/[a(v,)
+ 7-J
After the wavelength is adjusted for n(v,), subsequent adjustments will disturb the wavelength that is ideal for u(v,). Training pair z can afford to have the wavelength disturbed by at most
Therefore, the condition ~ f / 4 a ( v , )> LIAI,
+ ax,, + a~,,~ + . . . + AA,-,~
71-1
=
1r,+lA,/[a(v,+l) +
7,3+11
(4.1)
,=I
must be satisfied. Since rJ+l < A, and the A, are decreasing, (1) will hold if n ( v , ) <<
a(vk) when
I
This can be achieved by applying a function to the result of a( ), which redistributes the projection values prior to determining A. This is the role of u( ).
Mark J. Brady
408
5 Determining o( ) Let mA(,= min{ la(v,) -
a(vl;)l for all pairs i, k }
Defining o( ) as o[a(v)]= @V)/TI?Af'
will cause the terms o n the right-hand side of (1) to decrease rapidly enough so that (1) holds.
References Carrol, S. M., and Dickinson, B. W. 1989. Construction of neural nets using the Radon transform. In Proceedings of the International Joint Conference on Neural Networks, pp. 1-607-1-611, Washington, D.C., June 1989. IEEE TAB Neural Network Committee. Cybenko, G. 1989. Approximation by superposition of a sigmoidal function. Math. Control Systems Signals 2(4), 303-314. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 18S192.
Received 10 April 90; accepted 30 July 90.
Communicated by Andrew Barto
Active Perception and Reinforcement Learning Steven D. Whitehead Dana H. Ballard Department of Computer Science, University of Rochester, Rochester, NY 14627 U S A This paper considers adaptive control architectures that integrate active sensorimotor systems with decision systems based on reinforcement learning. One unavoidable consequence of active perception is that the agent's internal representation often confounds external world states. We call this phenomenon perceptual atiasing and show that it destabilizes existing reinforcement learning algorithms with respect to the optimal decision policy. A new decision system that overcomes these difficulties is described. The system incorporates a perceptual subcycle within the overall decision cycle and uses a modified learning algorithm to suppress the effects of perceptual aliasing. The result is a control architecture that learns not only how to solve a task but also where to focus its attention in order to collect necessary sensory information. 1 Introduction Recently there has been a resurgence of interest in intelligent control architectures that are based on reinforcement learning methods (RLM) (Barto et al. 1990a; Holland 1986; Miller et al. 1990; Sutton 1988; Watkms 1989; Wilson 1987). These architectures are appealing because they are both situated and adaptive. Nevertheless, they have been applied only to simple tasks, such as pole balancing (Barto et al. 1983), simplified navigation (Barto and Sutton 1981; Watkins 1989; Wilson 19871, and easy manipulation games (Anderson 1989). One problem that has prevented these architectures from being applied to more complex control tasks has been the inability of reinforcement learning algorithms to deal with limited sensory input. That is, these learning algorithms depend on having complete access to the state of the task environment. Since realistic sensory and effector systems must inevitably act on a portion of the task environment, the need for a complete representation is extremely limiting. To overcome this limitation, we have taken an active approach to perception and action, based on indexical representations (Agre and Chapman 1987). Our main result shows that integrating Neural Computation 2, 409419 (1990) @ 1990 Massachusetts Institute of Technology
S. D. Whitehead and D. H. Ballard
410
indexical representations (and active perception, in general) and reinforcement learning into a single control architecture is nontrivial because the use of indexical representations results in internal states that are ambiguous with respect to the state of the external world. We term this phenomenon perceptual aliasing and show that it severely interferes with the system’s ability to learn an adequate control policy. A new decision system has been developed that overcomes these difficulties. The result is an adaptive architecture that learns both the overt physical action needed to solve a problem and where to focus its attention in order to disambiguate the current situation with respect to the task. These ideas are illustrated in a system that learns a simple block manipulation task. 2 Foundations 2.1 Embedded Learning Systems. Our formal model for describing embedded learning systems is shown in Figure 1. The world is modeled as a deterministic automaton whose state changes depend on the actions of an agent. The world is formally described by the triple ( S E , A E , W ) , where SE is the set of possible world states, AE is the set of possible physical actions executable by the agent, and W is the state transition function mapping SEx AE into SE. Our model of the agent is slightly more complex, consisting of three components: a sensorimotor subsystem, a reward center, and a decision subsystem. The sensorimotor subsystem implements three functions: (1) a perceptual function P , (2) an internal configuration function Z, and (3) a motor function M . On the sensory side, the world is transformed into the agent’s internal representation. Since perception is active, this mapping is dynamic and depends on the configuration of the sensorimotor apparatus (i.e., P : SEx C + SI). On the motor side, the decision system has a set of internal commands A[ that affect the model in two ways: overt actions change the state of the external world (by being translated into external actions, AE), and perceptual actions change the configuration of the sensorimotor subsystem, C.’ As with perception, the configuration of the sensorimotor subsystem modulates the effects of internal commands. Thus, M : AI x C AE and 1:AI x C i C. The second subsystem in our model is the reward center. It implements a reward function R, which maps external states SE into real valued rewards R. Rewards are used by the decision subsystem to improve performance. --f
‘Here C is the set of possible configurations of the agent’s sensorimotor system. A particular configuration might specify the direction of gaze, the position of a manipulator, etc. In the indexical-functional sensorimotor system described below, the configuration of the sensorimotor configuration is described completely by the position of the agent’s markers.
411
Active Perception and Reinforcement Learning
I r-
W
-1
I
I
I I I There
I I
I center
I
I I I
I
I I I I I
I I I I I I I I I
I I I I I I I I
--
Figure 1: A formal model for embedded learning systems with active sensorimotor subsystems. The final component of the agent is the decision subsystem. The decision subsystem does not have access to the state of the external world, but only the agent's sensed internal representation. On the motor side, the decision subsystem generates internal action commands that are interpreted by the sensorimotor system (i.e., B : S l x R + A , ) . The objective of the subsystem is to learn a control policy that maximizes its return, which is defined as a discounted sum of the reward received over time: 11
return
=
1yrLr,+,,
(2.1)
r,=O
where rf is the reward received at time t , and y is a discount factor between 0 and 1 (Watkins 1989). 2.2 Indexical Representations. The central premise underlying indexical representations is that a system need not name and describe every object in the domain, but instead should register information only about objects that are relevant to the task at hand. Further, those objects
S. D. Whitehead and D. H. Ballard
412
should be indexed according to the function they play in the current behavior. Two important implications of this approach are (1) it leads to compact and limited scope internal representations since at any moment the sensory system registers only the features of a few key objects; and (2) it leads to systems that actively control their perceptual apparatus - actively manipulate the binding between objects in the world and internal representational structures. This active approach to sensorimotor design is realized with markers. Markers are analogous to variables, implemented by the sensorimotor system; they get dynamically bound to objects in the world and remain bound to those objects until being bound to other objects. Changing a marker’s binding is accomplished by executing explicit actions specifically targeted for that marker. Sensory inputs register features of and relationships between marked objects. Markers also play an important role in motor control since overt actions are predominately specified with respect to markers. In this case, a marker’s binding acts to establish the reference frame in which the action is per formed. A key feature of markers is the constraint that there are only a limited number of them, say less than 10. Figure 2 shows a sensory-motor subsystem, used by the block stacking program described below, which has two markers. The small number of markers and the limited number of features associated with each marker keep both the internal representation and the number of possible actions much smaller than is possible with conventional representations. If an object in the world is not bound to a marker, then it is invisible to the system (except for the effects it registers in peripheral inputs). 2.3 Reinforcement Learning. The task faced by the decision subsystem is a classic decision problem: given a description of the current state, a set of possible actions, and previous experience, choose the best next action. We have focused on a representative reinforcement learning method known as Q-learning (Watkins 1989); however, our result regarding the interactions between RLMs and active perception apply to virtually all critic-based reinforcement learning algorithms (Barto et al. 1990b). A decision system based on Q-learning maintains two interdependent functions: an action-value function Q, and a policy function x. The action-value function, Q, represents the system’s estimate of the relative merit of making a given decision. That is, Q(z,u ) is the return the system expects to receive given that it executes action a in state z and follows its regular policy (x)thereafter. The policy function, T , is the system‘s current estimate of the optimal policy. This function maps internal states (z E S I ) into action commands ( a E AI) and is defined in terms of the action-value function as follows: ~ ( z= ) a such that
Q(z,u ) = max[Q(z,b)] bEAr
(2.2)
Active Perception and Reinforcement Learning
413
Senson TnDutS;
on-healor: -d. 01 - g r e ~ n 1 . 0 blm)
-
m-he-srack-hcight
a-venially-aligned-p
Figure 2: The specification for the indexical sensorimotor system used by a program (to be described later) that learns to solve a simple block manipulation task. The system has two markers, the action frame and the attention frame. The action frame is used for both perception and action, while the attention frame is used only for perception. Each marker has a set of local aspects associated with it; these are the color and shape of the marked object, the height of the stack the marked object belongs to, whether or not the marked object is sitting on the table, and whether or not the marked object is held by the robot. The system has two relational aspects: one for recording vertical alignment between the two markers and one for recording horizontal alignment. Peripheral aspects include inputs for detecting the presence of red, green, and blue objects in the scene, and for detecting whether the hand is currently gripping an object. The internal motor commands for the system are shown on the right. All told the sensorimotor system has a 20-bit input vector: (4 bits of peripheral aspects, 14 bits of local aspects, 2 bits of relational aspects), and 14 actions (8 overt, 6 perceptual).
5. D. Whitehead and D. H. Ballard
414
Initially, the action-value function may be erroneous. However, a learning ruIe for incrementally improving Q is given by
where Qtis the action-value function at time t, rt+l is the reward received at time t + 1, (Y is a constant that affects the learning rate, and U, is the state utility function. For Markov decision problems, Watkins (1989) has shown that when this rule is embedded in a control procedure that (1) visits each state infinitely often and that (2) decreases the learning 01, it is guaranteed to converge on the optimal rate over time (i.e., a decision policy. The learning rate can be improved substantially by incorporating/learning a domain model and updating the decision system based on both actual experience (trial-and-error learning) and hypothetical experience (planning) (Sutton 1990). A simple (nonoptimal) decision subsystem implementing this rule works as follows. The first step is to select the next action. Ninety percent of the time the system selects the action specified by its control policy x(.r); the remaining 10% of the time it chooses an action at random. The action is then executed and the subsequent state and reward are noted. Once the effects of the action are known, the error in the action-value function for the current decision can be computed and used to update Q. Finally, x(.z) and U(.r) are updated to reflect changes in &. The reason the decision system does not always select the action specified by its policy is that the action-value of a decision is updated only when that decision is executed. Occasionally selecting a random action ensures that each decision will be evaluated periodically. [In our experiments we implemented a slightly more complex rule that uses a weighted sum of n-step errors (Sutton 1988; Watkins 19891.1 ---f
3 Perceptual Aliasing
The straightforward integration of indexical representations and RLM decision systems leads to undesirable interactions that prevent the decision system from learning an optimal control strategy. These interactions arise because the mapping between world states and internal states is many to many, depending on the configuration of the sensorimotor system. We call this overlapping between world and internal state spaces perceptual aliasing and say an internal state is perceptually ambiguous if it can represent multiple world states with different utilities. Figure 3 shows how perceptual aliasing can arise in a very simple state space. In the figure, the mapping between internal states and world states is one to one for all states except external states s2 and s5, which
Active Perception and Reinforcement Learning
415
Mapping from world state space to internal state space is 1-1 except for S, and S, which both map to &
Figure 3: An example of the effect of perceptual aliasing on the utility function. The top graph shows a state transition diagram for a simple problem. In this case, the system receives reward only on entering state G and the utility function monotonically increases as the distance from the goal decreases. The bottom graph shows the internal state space for the problem when the sensory system confounds states s2 and s5. In this case, the utility function is no longer monotonic and the optimal decision policy is unstable. m a p to internal state s : , . ~ After running the 1-step Q-learning algorithm for many trials, we find the following. First, since the state utility a n d action-value functions represent expected returns, for sa they take o n values somewhere between the corresponding values for s2 a n d s5 that would have been obtained had the learning component h a d direct access to the world state. That is, U ( s 2 ) 5 U(s:) 5 U ( s g ) a n d Q(s2,a,) 5 Q(sa, a,) 5 Q ( s 5 , o T ) , where a, is the optimal action associated 21n this discussion, internal states are denoted with the superscript "i". Thus, s; denotes the internal state corresponding to the world state .st. c: is the exception corresponding to both .s2 and s g .
416
S. D. Whitehead and D. H. Ballard
with s’,. Second, the state utility and action-values do not monotonically increase as the system approaches the goal. Instead a local maximum in the utility function arises at sh that is potentially devastating to the development of an optimal policy.
4 Dealing with Perceptual Aliasing Our main result is a decision system based on reinforcement learning that can cope with perceptual ambiguity. The new algorithm has two phases: a perceptual phase and a decision phase. During the perceptual phase, a series of perceptual actions are executed, and during the decision phase a single overt action is executed. Perceptual actions are actions that change the configuration of the sensorimotor system (e.g., shift the direction of gaze), but do not affect the state of the external world. They are denoted by the set Ap. Overt actions are actions that change the state of the world (e.g., manipulating an object). They are denoted by the set Ao. During the perceptual phase, the system executes a sequence of perceptual actions in order to collect a set of internal representations (views) of the current world state. Denote the set of internal states obtained during the perceptual phase as S,. During the decision phase, the system chooses an internal state, called the Zion state, that takes the lion’s share of the responsibility (credit or blame) for the outcome of the current decision. If the system is following its policy then the lion is defined as lion = s1 such that 3,{Q(sl,al) = maxs~s,,n~~,[Q(s,a)]}. That is, the lion corresponds to the state, among S,, that has the maximal action-value. When the system chooses a random action, arandom, the lion is defined as lion = SI such that Q(s1,arandom) = maxSEst [Q(s,arandom)]. In this case, the lion is the state, among S,, with the maximal action-value consistent with the action (Grandom. The idea underlying the use of a lion is that the lion state should be an internal state that unambiguously represents the current world state, and once such a state is found it is used to direct all actions associated with the world state it represents. Perceptually ambiguous lions are detected and suppressed as follows. If the action-value associated with the lion, Q(s1,at), is greater than the estimated return obtained after one step, rt + yU(st+l), then the lion is suspected of being ambiguous and the action-value associated with it is suppressed ( e g , reset to 0.0). Actively reducing the action-values of lions that are suspected of being ambiguous gives other (possibly unambiguous) internal states an opportunity to become lions. If the lion does not overestimate the return, it is updated using the standard 1-step Q-learning rule. To prevent ambiguous states from climbing back into contention, the estimates for nonlion states (i.e., s E S, and s # lion) are updated at a lower learning rate and only in proportion to the error in the lion‘s estimate. The observation that allows this algorithm to work is that ambiguous states will eventually (one time or another) overesti-
Active Perception and Reinforcement Learning
b- A C
417
+L-
70
60
50
40
Average Solution Time SO
20
10
0
0
50
100
150
200
250
300
350
400
450
Number of Trlals
Figure 4: (Top) Shown is a sequence of world states in a typical solution path for the block manipulation task. Depending on the placement of the attention frame (not shown), states 1, 3, 4, 5, and 6 may be represented ambiguously. (Bottom) Shown is a plot of the average number of steps required to solve the block manipulation task as a function of the trials seen by the agent. Initially the number of steps required is high, near the maximum of 75, since the robot thrashes around randomly searching for reinforcement. However, as the robot begins to solve a few problems its experience begins to accumulate and it develops a general strategy for obtaining reward. By the end of the experiment, the time required to solve the problem is close to optimal. The system's performance does not converge to optimal since 10% of the time it chooses a random action.
418
S. D. Whitehead and D. H. Ballard
mate action-values; consequently, they will eventually be suppressed. An ambiguous state overestimates because its utility value is an average, over the world states it represents, of the return received. When a world state with a return lower than the average is encountered, the lion overestimates [e.g., in Section 3, U(s2) < U(sa)]. On the other hand, it can be shown that an unambiguous lion is stable (i.e., will not overestimate its action-value) if every state between the lion and the goal also has an unambiguous lion. Thus, ambiguous states are unstable with respect to lionhood, while unambiguous states eventually become stable. 5 An Example
To test our ideas we have implemented a system that learns a very simple block manipulation task. An agent is presented with a pile of blocks on a conveyor belt. The agent can manipulate the pile by picking and placing blocks. When the agent arranges the blocks in certain goal configurations, it receives a fixed reward of 5000 units. Otherwise it receives no reward. When the agent solves the puzzle the pile immediately disappears and a new pile comes down the belt. If the agent fails to solve the puzzle after 75 steps, the pile faIls off the end of the conveyor and a new piIe appears at the front. A pile contains a random number of blocks (maximum 50) which are arranged into random stacks (maximum height 4). A block can be any one of three colors: red, green, or blue. The robot’s sensorimotor system is the indexical system described in Figure 2. The particular task we studied rewards the agent whenever it picks up a green block. Although this task is simple, it foils conventional reinforcement learning algorithms since improper placement of the system’s markers leads to perceptually ambiguous internal states (Fig. 4, top). Nevertheless, our new algorithm learns an adequate policy despite the existence of these states (Fig. 4, bottom). Acknowledgments This work was supported in part by NSF research Grant DCR-8602958 and in part by NSF research Grant IN-8903582.
References Age, P. E., and Chapman, D. 1987. Pengi: An implementation of a theory of activity. AAAI 268-272. Anderson, C. W. 1989. Towers of Hanoi with connectionist networks: Learning new features. In Proceedings of the Sixth International Conference on Machine Learning, pp. 345-350, Ithaca, NY. Morgan Kaufmann, San Mateo, CA.
Active Perception and Reinforcement Learning
419
Barto, A. G., and Sutton, R. S. 1981. Landmark learning: An illustration of associative search. Bid. Cybernet. 42, 1-8. Barto, A. G., Sutton, R. S., and Anderson, C. W. 1983. Neuron-like elements that can solve difficult learning control problems. I E E E Trans. Syst. Man, Cybernet. SMC-13(5), 834846. Barto, A. G., Sutton, R. S., and Watkins, C. 1990a. Sequential decision problems and neural networks. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., Morgan Kaufmann, San Mateo, CA. Barto, A. G., Sutton, R. S., and Watkins, C. 1990b. Learning and sequential decision making. In Learning and Computational Neuroscience, M. Gabrial and J. W. Moore, eds., MIT Press, Cambridge, MA. (Also COINS Tech. Rep. 89-95, Dept. of Computer and Information Sciences, University of Massachusetts, Amherst, MA 01003.) Holland, J. H. 1986. Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In Machine Learning: An Arfificinl lntelligence Approach. Volume Il. Morgan Kaufmann, San Mateo, CA. Miller, W. T., Sutton, R. S., and Werbos, P. J. 1990. Neural Networks for Control. MIT Press, Cambridge, MA. Sutton, R. S. 1988. Learning to predict by the method of temporal differences. Machine Learning 3(1), 9-44. Sutton, R. S. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, pp. 216-224, Morgan Kaufmann. Watkins, C. 1989. Learning from delayed rewards. Ph.D. thesis, Cambridge University. Wilson, S. W. 1987. Classifier systems and the animate problem. Machine Lenrning 2, 199-228.
Received 5 June 90; accepted 30 July 90.
Communicated by Antonio Damasio
Structure from Motion with Impaired Local-Speed and Global Motion-Field Computations Lucia M. Vaina Intelligent Systems Laboratoy, College of Engineering, Boston University, Boston, M A 02215 USA, and Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, M A 02139 U S A
Norbert0 M. Grzywacz Center for Biological lnformation Processing, Department of Brain and Cognitive Sciences, Massacliusetts Institute of Technology, Cambridge, M A 02139 U S A Marjorie LeMay Brigkam and Women‘s Hospital, Haruard Medical School, Boston. MA 02115 U S A
Humans can recover the three-dimensional structure of moving objects from their changing two-dimensional retinal image, in the absence of other cues to three-dimensional structure (Wallach and O’Connelll953; Braunstein 1976). In this paper, we describe a patient, A.F., with bilateral lesions involving the visual cortex who is severely impaired on computing local-speed and global-motion fields, but who can recover structure from motion. The data suggest that although possibly useful, global-motion fields are not necessary for deriving structure from motion. We discuss these results from the perspective of theoretical models for this computation.
1 Introduction
The recovery of structure from motion is the ability of humans to perceive the three-dimensional structure of objects from motion cues. This ability is one of the building blocks of our perception of three dimensions and has been extensively studied both psychophysically (Wallach and OConnelll953; Johansson 1975; Braunstein 1976; Todd 1984; Braunstein and Shapiro 1987; Grzywacz et al. 1987; Ramachandran et al. 1988; Siege1and Andersen 1988; Dosher et al. 1989; Husain et aI. 1989; Hildreth Neural Computation 2, 420-435 (1990) @ 1990 Massachusetts Institute of Technology
Structure from Motion
421
et al. 1990) and theoretically (Ullman 1979, 1984; Longuet-Higgins and Prazdny 1981; Bobick 1983; Bruss and Horn 1983; Koenderink and van Doorn 1986; Grzywacz and Hildreth 1987). Perhaps the most important lesson from these studies is that to recover structure from motion the brain assumes that moving objects are rigid (Ullman 1979) or quasirigid (Ullman 1984; Koenderink and van Doorn 1986; Grzywacz and Hildreth 1987). What computations does the visual system use in order to obtain structure from motion? The most natural candidate for such computations is the velocity field. Theoretically, one can usually obtain the structure of a rigid object if its velocity field is known (Longuet-Higgins and Prazdny 1981; Bruss and Horn 1983). This is possible, because under the rigidity assumption, the distance between the object’s features is constant, imposing simple geometric constraints on the velocities. Psychophysical studies of two- and three-dimensional structures from brief motion observations (Johansson 1975; Dosher et al. 1989) support the use of velocity in the recovery of structure from motion. It has been shown, however, that instantaneous velocity measurements alone may not be sufficient, since it takes long observation periods to perceive structure from motion accurately (Grzywacz et al. 1987; Siege1 and Andersen 1988; Husain et al. 1989; Hildreth et al. 1990). Alternative forms of computations on the image might also be used to recover structure from motion. One such computation involves the combination of feature positions over time without explicit use of velocity information (Ullman 1979, 1984; Grzywacz and Hildreth 1987). This computation also works by exploiting rigidity: The constancy of the distance between an object’s features geometrically constrains the structures consistent with a set of its views. Other computations that possibly underlie the recovery of structure from motion include the use of geometric characteristics of the image, such as luminance boundaries (Ramachandran et al. 1988) and axis of symmetry (Braunstein and Shapiro 1987), which allow the use of partial velocity information, such as motion direction (Bobick 1983). An important insight into which of the above computations are necessary for the recovery of structure from motion may come from studying the performance of patients with focal lesions involving the visual cortex. Selective visual motion deficits in patients who can recover structure from motion would indicate which motion measurements are not necessary for this computation. 2 The Patient
In this paper, we present data from a patient whose performance on psychophysical motion tasks indicates that the recovery of structure from
422
L. M. Vaina, N. M. Grzywacz, and M. LeMay
motion does not necessarily rely on the computation of velocity-field.’ The patient, A.F., a 60-year-old left-handed man, was studied following an acute hemorrhagic infarct in the posterior right hemisphere. The only significant aspect of his clinical background is a history of untreated hypertension. A magnetic resonance imaging (MRI) study, performed 3 months after the stroke, demonstrated a large, new, hemorrhagic infarction in the posterior right hemisphere involving the region of the temporal-parietal-occipital junction. The MRI study also revealed an old, smaller, lesion in the left hemisphere involving the same anatomical areas as on the right hemisphere. Figure 1 shows A.F.’s visual fields, and the anatomical loci of his bilateral lesions both in axial MRI and on a schematic representation of the lateral view of a human brain. On neuroophthalmological examination, he had good letter acuity (20/30 in both eyes), normal contrast sensitivity at spatial frequencies ranging from 0.2 to 9 cyeles/degrees, normal saccadic eye movements to static targets, but impaired saccades to moving targets, and the optokinetic nistagmus was absent in all directions. He also substituted saccades for smooth pursuits when he followed a smoothly moving target. This occurred for both motions to the right and to the left. On the neuropsychological evaluation with the Wechsler Adult Intelligence Scale-Revised, he obtained an average Verbal IQ of 104 and a severely depressed Performance IQ of 68. Reading, writing, and oral calculations were not impaired. He showed no deficits on visual semantic tasks. Color (Farnsworth-Munsell 100-Hues test), form (Efron 1968), and texture discrimination (Julesz 1984) were excellent. His drawing was poor, disorganized, and lacked perspective. Stereopsis, tested with Julesz stereograms (Julesz 1971), was lost at all disparities tested, which distressed him enormously, since previously he enjoyed anaglyphs and geometric puzzles. Monocular depth, tested informally, was good. We followed the patient on a regular basis for 18 months, and his performance on the psychophysical tasks and the resuIts of the neuroophthalmological examinations remained consistent throughout the period.
‘The stimuli were generated and presented, and responses collected and analyzed, using a Macintosh IIcx computer with an extended 8 bit video card. Stimuli were presented in the center of the Macintosh standard RGB monitor with a resolution of 640 x 480 pixels with frequency of the vertical retrace interrupt of 66.7 Hz. The contrast linearity of the display was measured and found to hold up to 98% contrast. In all the psychophysical tasks reported here, the display consisted of dynamic random dot patterns, which were presented for 2 sec in each trial. The viewing distance was 65 cm and the dot size was 1.8 x 1.8 arcmin. The background in the display was black and the dots were white. The room illumination was maintained at the low photopic level and the subjects viewed the display binocularly. Most of the normal subjects were naive observers.
Structure from Motion
423
3 Experiments The patient showed deficits in local speed discrimination and perception of motion coherence. The speed discrimination task (Vaina 1988, 1989; Vaina et al. 1988, 1989) is illustrated in Figure 2A. The display consisted of two rectangles (4 x 2.5 degrees'), each containing 20 dots: moving in randomly distributed directions. In each rectangle, for a given trial, the speed of all the dots was constant, forming the basis for a grouping into a global, coherent, speed field. This was a two-alternatives forced-choice task in which the subject had to determine in which rectangle the dots were moving faster. The faster speed, presented at random in either rectangle, was always 4.9 degrees/sec and the speed ratios ranged from 1.1 to 5.5. Data from A.F. (20 trials per speed ratio) and normal controls (26 subjects with 10 trials per speed ratio) is shown in Figure 2B. Although the control subjects performed well (above 75% correct) at the 1.47 speed ratio, A.F. failed on this task u p to a ratio of 5.5. He was also dramatically impaired (Fig. 2D) on a motion coherence task similar to that of Newsome and Par6 (1988) (Fig. 2C). As in their task, each trial presented a dynamic random dot field with a given percentage, p , of dots moving in a prespecified direction. The display was cyclic with a period of about 50 msec. In each cycle, a new set of dots was presented for 16 msec, after which it was erased and a new set plotted. (There were 15 dots in a 100 degrees2 square aperture.) The probability that any given dot in a frame would be displaced (9 arcmin) in the chosen direction was p . This means that the probability for a given dot to move in the same direction for n + 1 frames decreased as p". The task was a four-alternatives forced-choice in which the subject was asked to determine the direction of motion, which could be up, down, left, or right. A staircase procedure was used to obtain the threshold of motion signal (percentage coherence) required to reliably determine the direction of motion. The threshold was obtained separately for presentations in left and right visual fields. The performance of A.F. was significantly below that of the normal controls (z test; p < 0.001) for presentations in either visual fields (Fig. 2D).3 Furthermore, inspection of details of 'Dot density did not appear to play a significant role in A.F.'s performance in speedrelated tasks. First, A.F. obtained similar results on the same task as in Figure 2A, but with dot density 50% lower. Second, A.F. scored at chance on a task of boundary detection by relative speed, in which the dot density was very high (50%, light and 50% dark - Vaina et al. 1990). This deficit cannot be explained by an impaired ability to detect boundaries. The patient scored in the normal range on a task of boundary discrimination by relative directions and in tasks where the random-dot background was static (Vaina et al. 1990). 3The subjects were instructed to maintain fixation on a mark placed 2 degrees to the right or left from the border of the aperture. Eye movements were informally controlled by the examiner, and those trials in which the subject failed to maintain fixation were discarded. However, we believe that controlled fixation would not modify
424
L. M. Vaina, N. M. Grzywacz, and M. LeMay
the data indicates that the probability that the performance was above chance for percentages of coherence equal o r lower than 17% was less than 0.01. (This percentage of coherence was the largest value in the staircase procedure that was below threshold for both visual fields.) The highest degree of coherence set by the staircase procedure for which reversals occurred was 78% (even in the worst normal subjects, reversals did not occur beyond 10% coherence). In spite of these deficits, A.F. performed well in a structure-frommotion task. Although in certain conditions his performance was not as good as that of normal subjects, it always remained well above chance. This was surprising, since our previous studies showed that patients with more extensive right parietal lesions failed this task (Vaina 1988, 1989; Vaina et al. 1988, 1989). They never saw a single object, but instead they Figure 1: Facing page. Visual fields and four axial magnetic resonance images (MRI) of the patient’s brain, and their localization on a schematic lateral view of a human brain. The left and right visual fields are illustrated in A and B, respectively. The patient has a congruous loss of the left inferior visual field bilaterally and a minimal loss in the upper visual field. (C,D,E,F) The relevant slices in axial view of T2 weighted MRI (TR 2000/TE80). [Although A.F. is left handed (Oldfield-Geshwind questionnaire score of -50) the MRI scans show the anterior portion of the right hemisphere and the posterior portion of the left hemisphere to be wider than their counterparts, which is the type of asymmetry seen in the majority of normal right-handed individuals - LeMay and Kid0 1978.1 Labeling: bl, body the lateral ventricle; h, hemosiderin; fl, frontal lobe; oh, occipital horn of the lateral ventricle; pl, parietal lobe; PO, parietal-occipital sulcus; sf, Sylvian fissure; ssc, supercellar cistern; t, trigone of the lateral ventricles; tl, superior temporal gyrus; t2, medial temporal gyrus; ptof, parietotemporo-occipital fossa (see G). (C)The picture shows the temporal horns of the lateral ventricles and some patchy hyperintensities (labeled as ptof) at their margins, which appear to be more intense in the right hemisphere. (D) The picture shows local tissue loss in the right hemisphere at the site of the hemorrhagic stroke in the parietal lobe just medial to the occipital horn of the lateral ventricle and anterior to the parietal occipital sulcus. There is also some patchy hyperintensity in the temporal lobe along the lateral margin of the right occipital horn. Localized hyperintense areas are seen in the left temporal lobe adjacent to the lateral proximal margin of the occipital horn. Small areas of increased signal are scattered in the basal ganglia and in the deep white matter by the bodies of the lateral ventricles, which may be associated with small vessel ischemic changes. Images E and F show bilaterally larger patchy areas of hyperintensity posterior, medial, lateral, and above of the trigones and at the margins of the bodies of the lateral ventricles. Tortuous narrow bands of hemosiderin (labeled h) are seen in both images at the site of the recent hemorrhage. Continued next page. the conclusions of this paper, because in an earlier similar test without fixation point, A.F. was still impaired.
Structure from Motion
425
Figure 1: Continued. (G) A schematic drawing of the adult human brain showing the major gyri and sulci in lateral view. Labeling: IFg, inferior frontal gyrus; ITg, inferior temporal gyrus; MFg, middle frontal gyrus; MTg, middle temporal gyrus; PoCg, postcentral gyrus; PrCg, precentral gyrus; SFg, superior frontal gyrus; STg, superior temporal gyrus; STS, superior temporal sulcus; MTS, middle temporal sulcus; PO, parietal occipital sulcus; SYL, Sylvian fissure; PTOF, parietotemporo-occipital fossa (see Figure 271 of Polyak 1957); several authors suggested that it corresponds to the human homologue of the macaque, MT. The horizontal lines correspond to the levels of the axial scans C, D, E, and F.
426
L. M. Vaina, N. M. Grzywacz, and M. LeMay
described the display as “birds flying,” “ants crawling,“ “snow blown by the wind,” or ”just dots moving.” The display in this task showed two dynamic random-dots fields (3 x 3 degrees2) each defined by 128 dots4 with finite point lifetime. At the end of its lifetime, a point disappeared and was replotted in a new random location (within the boundary of each display) and began a new trajectory. One of the fields portraited the orthographic projection of a hollow cylinder rotating with 30 degrees/sec around its vertical axis. The average displacement of the dots between two frames was 5.6 arcmin. The rotation was simulated by 50 frames defined by dots lying at random locations on the cylinder’s transparent surface (Figure 2E). The other random dot field contained an unstructured stimulus generated by randomly shuffling the velocity vectors present in the structured display and thereby destroying the spatial relationship between vectors. In both the structured and unstructured displays were 50 frames and the duration of each frame was 33 msec. The spatial positions (left and right) of the structured and unstructured field were randomly assigned. In one version of the task, the point lifetime was held constant at 400 msec and the amount of structure varied by shuffling an a priori determined percentage of the velocity vectors. In the second version of the task, the structure was held constant (100%) and the point lifetime of the dots in both the structured and unstructured displays varied. For both versions, there were 20 trials for each condition. We used a two-alternative forcedchoice task in which the subject was asked to judge which of the two random dots fields represented a better cylinder.
Figure 2 (Facing page): Psychophysical motion tasks and data from A.F. and normal subjects. (A, C, E) Illustrate the paradigms used; for details of the experimental design see text. (A) A schematic representation of the stimuli employed in the local speed-discrimination task. (B) Results of the speed4iscrimination task. The graph plots the percentage of correct answers as a function of the speed ratios between the two rectangles. The data for the normals are presented as a shaded area representing the mean *1 standard deviation. A.F.’s data points present the means and the standard errors. For all ratios equal to or larger than 1.47, A.F.’s performance was significantly impaired compared to the normals. ( C ) A schematic representation of the stimuli employed in the motion-coherence task. (D) Results of the motion-coherence task. The graph plots the thresholds of percentage coherence required by normal subjects and A.F. to discriminate the net direction of motion. AX’S threshold is significantly elevated compared to the normals. (E) A schematic representation of the stimuli employed in the structure-from-motion task. Continued on p . 428. 4The results of normal subjects and A.F. were not statistically significantly different when the dot density was 50% lower.
427
Structure from Motion
4
110
B
I oe
...
(I.”<.
spe.u
1-
a
FIc
cr’
a
5 0%
0%
ua1io
100%
D A
A
[TI . .
..
... .. . .. .. .. . ... . . .. .. . .. .. . .:
y\ . . .. . .. . . ... .. . ... .. . . ... . . . . .
E
L. M. Vaina, N. M. Grzywacz, and M. LeMay
428
................................................... Ch.ncC
2.3
0
x
LO
10
lm
structure
G
.............................................
.'!?"I?
Figure 2: (F,G) Results of the structure-from-motion task. For plotting the data, we followed the same procedures as in B. (F) The graph plots the percentage of correct answers as a function of the fraction of structure in the rotating cylinder. The arrow represents the mean results for normal subjects with the task modified such that the frames in the cylinder were presented in random order. A.F.'s performance was in the normal range for percentages of structure in the cylinder equal or above 65%, but was significantly below that of the normal subjects in the trials where the structure was equal or below 42%. As indicated by the arrow, the subjects were probably not using texture cues to solve this task. (G) The graph plots the percentage of correct answers as a function of the point lifetime of the dots. A.F.'s performance was in the normal range for the lifetime equal to 400 msec, but was significantly below that of the normal subjects in the trials with lifetimes equal to 200 and 100 msec.
Structure from Motion
429
The data presented in Figure 2F and G show that A.F.'s performance on this task was well above chance for all conditions. Even in the presence of noise, when the percentage of structure was reduced (Fig. 2F), his performance was in the normal range for percentages above 42%. A.E's performance was significantly reduced for percentages of structure equal or lower than 42% (x2test, 11 < 0.007). On these percentages of structure his scores were significantly lower than those of the worst normal subjects. Similarly, his performance was also dependent on the value of the point lifetime (Fig. 2G). Statistical analysis (Cochran-Mantel-Haenszel test, p < 0.03) indicates that A.F.'s performance decreases as the point lifetime decreases. In addition, although his performance was in the normal range for point lifetime of 400 msec, he was significantly impaired for 200 and 100 msec. That A.F. could perform the task, even though he had a severe deficit on the motion coherence task, suggests that the ability of computing global motion fields is not necessary for deriving structure from m ~ t i o n For . ~ example, at 11% structure he still performed well above chance (Fig. 2F), whereas in the Newsome and Park's task (Fig. 2C), the probability that he could obtain the global motion field at a coherence of 17% or less was very small ( p < 0.01). However, since his structure-from-motion performance decayed with reduced structure, it is likely that global motion fields are used if available. His dependence on the point lifetime indicates that he needs longer temporal integration than normal subjects for a good recovery of structure from motion. A possible explanation for this dependence is that A.F. requires a longer integration time to measure speed (McKee and Welch 19851, and that such measurement might be useful for computing structure from motion. The same explanation might account for his poor performance on the localspeed discrimination task, because in that task, the dot lifetimes were short (133 msec). It is conceivable that for longer dot lifetimes his ability to discriminate speed might have been better. 4 Discussion
It is particularly interesting that A.F. was selectively impaired on localspeed and motion-coherence discrimination tasks, but not on structurefrom-motion tasks, because the anatomical evidence suggests that his lesions may involve the neural circuitry supporting the human homologue of the monkey's middle temporal area (MT). Physiological and behavioral studies indicate that in monkeys MT is critical for all these 'In structure-from-motion tasks (Siege1and Andersen 1988; Vaina 1988,1989; Vaina et al. 1988,1989; Husain et al. 1989) like in the one presented here, one might be suspicious that texture cues contribute to the perception of three dimensionality. To control for this possibility, we repeated the task with the frames of the display containing the cylinder presented in random order. If texture cues were available, then the subjects should have been able to use them in making their choice. As Figure 2F shows, the normal subjects performed at chance under this condition
430
L. M. Vaina, N. M. Grzywacz, and M. LeMay
computations (Zeki 1969; MaunseIl and van Essen 1983). The conjecture that A.F.’s lesions disconnect biIaterally the human homologue of MT is supported by three sets of arguments. First, studies of changes in cerebral blood flow as monitored with positron emission tomography (PET) in a human observing low contrast moving stimuli and fast flicker (Miezin et aJ. 1987), and dynamic random-dot patterns (Lueck et al. 1989) showed significantly increased activity in the region of the fundus of the temporal-occipital-parietal fossa PTOF (Polyak 1957). This region was proposed as the possible correspondent of MT in humans (Miezin et al. 1987; Allman 1988; Lueck 1989; Zeki 1990). (Although there are limitations in the resolution of the PET equipment as compared to MRI, the former can localize the center of mass of the activity distribution well and, thus, is adequate for localizing MT) (Allman, personal communication.) Second, comparison of the myeloarchitecture pattern of human and monkey brains indicates that Flechsig area 16 (Flechsig 1920), which is the most myelinated area in the occipital-parietal cortex, might correspond to the heavily myelinated MT in the monkey (Allman 1977; Thurston et al. 1988). It is believed that Flechsig area 16 and PTOF correspond anatomically. As discussed in Figure 1, A.F.’s lesions appear to have disrupted the pathways to and from PTOF. Third, as discussed above, A.F. lacked smooth pursuits eye movements and it has been shown that MT is involved in the control of smooth pursuits (Komatsu and Wurtz 1988; Thurston et al. 1988). From the putative disconnection of the human homologue of MT in A.F.‘s lesions and his good performance on structure-from-motion tasks, we conjecture that MT is not necessary for structure from motion. This hypothesis seems to be at odds with Siegel and Andersen’s report (1986) that lesions of MT with ibotenic acid apparently disrupt the recovery of structure from motion in monkeys. (It is possible, however, that A.F. might have recovered his ability to derive structure from motion, since we first gave him the task 3 months after the stroke.) There is no evidence, however, that in Siegel and Andersen‘s experiment the monkey performed a structure-from-motion task. The animal was required to detect a change from an unstructured to a structured motion velocity field, which portrayed a rotating surface of a cylinder covered with short-lived dots. The monkey might have responded, for example, to the ratio between the speeds at the center and corners of the display, which is higher for the cylinder than for the unstructured movie. Or alternatively, the monkey could have just noticed the increase in speed of dots at the cester of the display when it became structured. Although these tricks were available to A.F. as well, we have direct evidence that he could recover structure from motion: Prior to the structure-from-motion task described here, we gave him a control display, which consisted of a dynamic random dot field portraying a rotating cylinder and asked him to describe verbally what he saw. The only instruction we gave him was that dots
Structure from Motion
431
portrayed a moving object. Without any hesitation, A.F. reported seeing a rotating cylinder. What types of computations may underlie A.F>‘s ability to recover structure from motion? Four possibilities come to mind. First, we have seen that complete deterioration of the ability to compute global-motion fields for low fraction of coherence does not imply the destruction of the ability to obtain structure from motion under similar noise conditions. However, the presence of noise in the structure-from-motion task appears to have produced a deterioration of A.F.’s performance (but his scores remain consistently above chance). Thus, though not necessarily, the computation of global-motion fields may help to recover structure from motion. Second, as suggested by several computational studies, local-speed measurements may also underlie this computation. The patient’s poor performance on the local-speed discrimination task rules out theories that require precise speed measurements (Longuet-Higgins and Prazdny 1981; Bruss and Horn 1983). However, rough speed estimates may help the perception of three-dimensionality, since there is a large speed ratio between the center and edges of the cylinder in our structurefrom-motion task. Such cue was available to A.F., as this ratio is larger in the cylinder than is his speed-discrimination threshold (Fig. 2B). As we speculated above, the partial deterioration of his performance with reduced point lifetime is consistent with local-speed cues being used to recover three-dimensional structure from motion. Third, another strategy for this computation is combining the positions of image features over time, which may not require any explicit instantaneous motion measurement (Ullman 1979, 1984; Grzywacz and Hildreth 1987). However, all position-based strategies that have been proposed so far would break down for finite point lifetime experiments (Todd 1984; Siege1 and Andersen 1988; Husain et al. 1989). This implies that A.F. did not use any such strategy to perceive the rotating cylinder. A recent algorithm that combines a position-based strategy with surface interpolation was shown to perform well under finite point lifetime conditions (Hildreth and Ando, personal communication), and it may provide a plausible strategy. Fourth, several studies speculated that specific geometric characteristics of the image, such as boundaries (Ramachandran et al. 1988) or axes of symmetry (Braunstein and Shapiro 19871, may contribute to succesful structure from motion processes. For example, a computational study showed that if in a rotating object the axis of rotation is known, then measuring the direction of motion and the projected distance from the axis are sufficient for recovering the structure of the object (Bobick 1983). The information necessary to apply this strategy might have been available to A.F. The axis of rotation was easy to locate in our structurefrom-motion task, because the axis was always in the middle of the display. Also, the only two directions of motion in the display were 180 degrees apart. The discrimination of these directions was probably an easy task for A.F., since indirect evidence indicates that A.E could
432
L. M. Vaina, N. M. Grzywacz, and M. LeMay
discriminate perfectly motion directions that differed by 37 degrees or more. This evidence comes from his performance on Hildreths task of boundary-localization on the basis of direction of motion (Hildreth 1984). The patient described in this paper has rather specific motion deficits (for another example of a motion-impaired patient see Zihl et al. 1983 and Hess et al. 1989). In particular, he is severely impaired on the discrimination and use of local-speed and on the perception of coherent motion. Thus, his good performance on structure-from-motion tasks is intriguing, since those are fundamental motion measurements, which furthermore are exclusively used by many computational theories of structure from motion. 5 Acknowledgments We are indebted to Ellen Hildreth, Bill Newsome, and Richard Andersen for critically reading this paper. Also, thanks are due to Thomas Kemper for useful comments on the neuroanatomical findings. We thank the Neuropsychology and the Young Stroke Units of the New England Rehabilitation Hospital for referring the patient to us and for allowing us to use the results of the neuropsychological and clinical examinations. We also thank Don Bienfang for performing the neuroophthalmological evaluation. A1 Choi programmed the software for the psychophysical tasks. L. M. V. and M. L. were supported by NEI Grant EY07861-01 and by the Boston University Seed Grants for Biomedical Research, 7712-5ENG-909. N. M. G. was supported by NSF Grant BNS-8809528, by the Sloan Foundation, by a grant to Tomaso Poggio and Ellen Hildreth from the Office of Naval Research, Cognitive and Neural Systems Division, and by a grant to Tomaso Poggio, Ellen Hildreth, and Edward Adelson from the NSF. References Allman, J. M. 1977. Evolution of the visual system in early primates. In Progress in Psychology, Physiology, and Psychiatry, J. Sprague and A. N. Epstein, eds., pp. 1-53. Academic Press, New York. Allman, J. M. 1988. The search for area MT in the human visual cortex. In Proceedings of f k e European Brain and Behavior Winter Conference: Segregation of Form and Motion, Tiibingen, West Germany, J. Rauschaker and G. Moon, eds., University of Tiibingen, Germany. Bobick, A. 1983. A hybrid approach to structure from motion. In Proceedings of the ACM Interdisciplinay Workshop Motion: Representation and Perception, pp. 91-109. Association for Computing Machinery, New York. Braunstein, M. L. 1976. Depth Perception Through Motion. Academic Press, New York.
Structure from Motion
433
Braunstein, M. L., and Shapiro, L. R. 1987. Detection of rigid motion in fixedaxis and variable-axis rotations. Invest. Ophthalmol. Vis. Sci. 28, 300. Bmss, A., and Horn, B. K. P. 1983. Passive navigation. Comput. Vision Graph. lmage Proc. 21,3-20. Dosher, B. A., Landy, M. S., and Sperling, G. 1989. Kinetic depth effect and optic flow - I. 3D shape from Fourier motion. Vision Res. 29, 1789-1813. Efron, R. 1968. What is perception? In Boston Studies in the Philosophy of Science, Vol. 4, R. S. Cohen and M. W. Wartofsky, eds., pp. 137-173. D. Reidel, Dordrecht, The Netherlands. Flechsig, P. 1920. Anatomie des Menschlichen Gehirns und Ruckenmarks auf Myelogenetisclrer Grundlage. G. Thieme, Leipzig, Germany. Grzywacz, N. M., and Hildreth, E. C. 1987. The incremental rigidity scheme for recovering structure from motion: Position vs. velocity based formulations. J. Opt. SOC.Am., A 4, 503-518. Grzywacz, N. M., Hildreth, E. C., Inada, V. K., and Adelson, E. H. 1987. The temporal integration of 3-D structure from motion: A computational and psychophysical study. In Organization of Neural Networks, W. von Seelen, W. G. Shaw, and U. M. Leinhos, eds., pp. 239-259. VCH Publishers, Weinhein, FRG. Hess, R. H., Baker, C. L., and Zihl, J. 1989. The "motion-blind" patient: Lowlevel spatial and temporal filters. J. Neurosci. 9, 1628-1640. Hildreth, E. C. 1984. The Measurement of Visual Motion. MIT Press, Cambridge, MA. Hildreth, E. C., Crzywacz, N. M., Inada, V. K., and Adelson, E. H. 1990. Percept. Psychophys., in press. Husain, M., Treue, S., and Andersen, R. A. 1989. Surface interpolation in threedimensional structure from motion. Neural Comp. 1, 324-333. Johansson, C. 1975. Visual motion perception. Sci. Am. 232, 76-88. Julesz, B. 1971. Foundation of Cyclopean Perception. University of Chicago Press, Chicago. Julesz, B. 1984. Toward an axiomatic theory of preattentive vision. In Dynamic Aspects of Neocortical Functions, C. M. Edelman, W. E. Gall, and W. M. Cowan, eds., pp. 585412. John Wiley, New York. Koenderink, J. J., and van Doorn, A. J. 1986. Depth and shape from differential perspective in the presence of bending deformations. J. Opt. SOC.Am., A 3, 242-249. Komatsu, H., and Wurtz, R. H. 1988. Relation of cortical areas MT and MST to pursuit eye movements. I. Localization and visual properties of neurons. J. Neurophysiol. 60, 580403, LeMay, M., and Kido, D. K. 1978. Asymmetries of the cerebral hemispheres on computed tomograms. J. Comput. Assist. Tomogr. 2, 471476. Longuet-Higgins, H. C., and Prazdny, K. 1981. The interpretation of moving retinal images. Proc. R. SOC.,Lond., B 208, 385-397. Lueck, C. J., Zeki, S., Friston, K. J., Deiber, M. P., Cope, P., Cunningham, V. J., Lammertsma, A. A., Kennard, C., and Frackowiak, R. C. J. 1989. The colour center in the cerebral cortex of man. Nature (London) 340, 386-389. Maunsell, J. H. R., and van Essen, D. C. 1983. Functional properties of neurons
434
L. M. Vaina, N. M. Grzywacz, and M. LeMay
in middle temporal visual area of the macaque monkey. I. Selectivity for stimulus direction, speed, and orientation. 7. Neurophysiol. 49, 1127-1147. McKee, S. P., and Welch, L. 1985. Sequential recruitment in the discrimination of velocity. 1. Opt. SOC.Am., A 2, 243-251. Miezin, F. M., Fox, P. T., Raichle, M. E., and Allman, J. M. 1987. Localized responses to low contrast moving random dot patterns in human visual cortex monitored with positron emission tomography. Neurosci. Abst. 13, 631. Newsome, W. T., and Park, E. B. 1988. A selective impairment of motion perception following lesions of the middle temporal visual area (MT). J. Neurosci. 8,2201-2211. Polyak, S. 1957. The Vertebrate Visual System. University of Chicago Press, Chicago. Ramachandran, V. S., Cobb, S., and Rogers-Ramachandran, D. 1988. Perception of 3-D structure from motion: The role of velocity gradients and segmentation boundaries. Percept. Psyckopkys. 44, 390-393. Siegel, R. M., and Andersen, R. A. 1986. Motion perceptual deficits following ibotenic acid lesions of the middle temporal area (MT) in the behaving rhesus monkey. Neurosci. Abst. 12, 1183. Siegel, R. M., and Andersen, R. A. 1988. Perception of three-dimensional structure from two-dimensional visual motion in monkey and man. Nature Condon) 331, 259-261. Thurston, S. E., Leigh, R. J., Crawford, T., Thompson, A,, and Kennard, C. 1988. Two distinct deficits of visual tracking caused by unilateral lesions of cerebral cortex in humans. Ann. Neurol. 23, 266. Todd, J. T. 1984. The perception of three-dimensional structure from rigid and nonrigid motion. Percept. Psyckopkys. 36, 97-103. Ullman, S. 1979. The Interpretation of Visual Motion. MIT Press, Cambridge, MA. Ullman, S. 1984. Maximizing rigidity: The incremental recovery of 3-D structure from rigid and rubbery motion. Perception 13, 255-274. Vaina, L.M. 1988. Effects of right parietal lobe lesions on visual motion analysis in humans. Invest. Opktkalrnol, Vis. Sci. 29, 434. Vaina, L. M. 1989. Selective impairment of visual motion interpretation following lesions of the right occipital parietal area in humans. Biol. Cybernet. 61, 347-359. Vaina, L. M., LeMay, M., Naili, S., Amarillio, I?, Bienfang, D., Montgomery, C., and Thomazeau, Y. 1988. Deficits of visual motion analysis after posterior right hemisphere lesions. Neurosci. Abst. 14, 458. Vaina, L. M., LeMay, M., Choi, A., Kemper, T., and Bienfang, D. 1989. Visual motion analysis with impaired speed perception: Psychophysical and anatomical studies in humans. Neurosci. Abst. 15, 1256. Vaina, L. p., LeMay, M., Bienfang, D. C., Choi, A. Y., and Nakayama, K. 1990. Intact "biological motion" and "structure from motion" perception in a patient with impaired motion mechanisms. Vis. Neurosci., in press. Wallach, H., and OConnell, D. N. 1953. The kinetic depth effect. J. Exp. Psychol. 45,205-217.
Structure from Motion
435
Zeki, S. M. 1969. Representation of central visual fields in prestriate cortex of monkey. Bruin Xes. 14, 271-291. Zeki, S. M. 1990. The form vision of achromatopsic patients. In "The Bruin," Abstracts of the Cold Spring Harbor Symposium on Quantitative Biology, p. 16. Zihl, J., von Cramon, D., and Mai, N. 1983. Selective disturbance of movement vision after bilateral brain damage. Bruin 106,313-340.
Received 16 March 90; accepted 30 July 90.
Communicated by Michael Jordan
Learning Virtual Equilibrium Trajectories for Control of a Robot Arm Reza Shadmehr Department of Computer Science, University of Southern California, Los Angeles, C A 90089-0782 U S A
The cerebellar model articulation controller (CMAC) (Albus 1975) is applied for learning the inverse dynamics of a simulated two joint, planar arm. The actuators were antagonistic muscles, which acted as feedback controllers for each joint. We use this example to demonstrate some limitations of the control paradigm used in earlier applications of the CMAC (e.g., Miller et al. 1987, 1990): the CMAC learns dynamics of the arm and not those of the feedback system. We suggest an alternate approach, one in which the CMAC learns to manipulate the feedback controller’s input, producing a virtual trajectory, rather the controller’s output, which is torque. Several experiments are performed that suggest that the CMAC learns to compensate for the dynamics of the plant, as well as the controller.
1 Introduction Flash (1987)has shown that in the human arm, for moderate speed movements, the spring-like behavior of the neuromuscular system is such that by manipulating an equilibrium point model of the arm, the CNS can produce relatively accurate movements even without considering the dynamics of the moving limb. In order to produce a precise trajectory, however, the applied neuromuscular activity should take into account the dynamics of the skeleton, as well as those of the attached muscles and the segmental feedback system. In this paper we show how one might learn to produce such a virtual equilibrium trajectory (Hogan 1984a). The learning paradigm is based on the cerebellar model articulation controller (CMAC) as proposed by Albus (1975), and demonstrated in the works of MiIIer et al. (1987, 1990) and Kraft and Campagna (1990). The CMAC is a coarse-coding technique that is implemented as a look-up table for approximating a piece-wise continous function with multiple input and output variables. In Miller et al. (19871, for example, the function was the inverse dynamics of a robot, the input variables described Neural Computation 2, 436446 (1990) @ 1990 Massachusetts Institute of Technology
Equilibrium Trajectories to Control a Robot Arm
437
the desired state of the robot, the output variables were joint torques, and the coding was done by layers of perceptrons that mapped the immense input space into a much smaller output table. The output of the network (torque) was then added to the output of a fixed-gain, error-feedback controller. In the application that we have considered here, the feedback controllers are the antagonistic muscles attached to the joints. This example will illustrate a limitation of the control scheme proposed by Miller et al. (1987, 1990): If the controller’s response depends on something other than the error in the plant’s output, then the CMAC will never be able to compensate for the dynamics of the controller, leading to persistent errors in the plant’s behavior. We propose that this limitation can be addressed if the CMAC learns to modify the input to the controller rather than the controller‘s output. In effect, the CMAC will learn the dynamics of the skeleton, as well as the muscles that act on it.
1.1 Arm Dynamics. Consider a two joint arm, with a pair of antagonistic muscles attached to each joint (Fig. 1). In the idealized case, shoulder and elbow torques, T = [T,T,], can be written as a function of
Figure 1: Schematic of the arm with the muscle-like actuators. End-effector positions in the text refer to a Cartesian coordinate system centered at the shoulder. Length of the upper-arm and forearm are 0.25 and 0.35 m, respectively.
438
Reza Shadmehr
joint position 0 = [& Be], velocity b, and acceleration 6,where m, 1, s, r , and b represent the mass, link length, distance from the center of mass to joint, rotary inertia, and viscosity of the joint:
We modeled the arm described in (1.1-1.2) using parameter values in Uno et al. (1989). Forward dynamics were calculated by solving (1.1-1.2) for Given a torque vector T ( t )at some 0 ( t )and b ( t ) ,the resulting a ( t ) was integrated to specify b(t + A ) and 8(t + A). We assumed that the force generated by a muscle can be essentially represented by considering its dependence on muscle length, velocity of contraction, and activation rate (Hogan 1984a). The torque generated by a pair of muscles acting on the shoulder joint, for example, was defined as T, = Zi7(& - #,) - B&, where Zi‘ is joint stiffness, B is the joint’s viscous coefficient, and & is the equilibrium position of the joint (Flash 1987).
a.
1.2 Trajectory Generation. Hogan (1984b)has suggested that for reaching movements, a trajectory is planned in which the change in acceleration of the hand (jerk) over the period of movement is minimized. For our case, in moving from an initial hand position (z, y7)at t = 0 to (x,y,) at t = n, the function to be minimized is
If motion begins and ends with zero velocity and acceleration, then it can be shown that hand trajectory always follows a straight line, and is described by (1.4) (1.5) where T = t/a. Applying inverse kinematics to (1.4-1.5) leads to a trajectory in joint coordinates. As an example, consider the case where the muscle parameters are set as follows: K = 40 N.m/rad, and B = 2 N.m/sec/rad, and the objective is to move the hand from (-0.3 0.2) to (0.1 0.5) in a = 0.7 sec (see
Equilibrium Trajectories to Control a Robot Arm
439
207
A
0 5 7 . 8 02
I
04
.
,
0.6
1
.
08
I
10
'
I
1 2
.
I
'
1 4
Time fsecJ
05 02
0 4
06
0 8
10
7 2
1 4
Time IsacJ
Figure 2: Desired and observed joint trajectories before and after learning. (A) Performance of the system before learning begun (RMSE = 0.1146 rad). (B) Performance of the system after the tenth learning iteration (RMSE = 0.073 rad).
Fig. 1). The desired and observed trajectories are plotted in Figure 2A. The RMS error (RMSE), averaged over 1.5 sec, was 0.1146 rad for this movement. Therefore by simply manipulating the equilibrium of the antagonist muscles, a reasonably accurate movement was accomplished. Our objective is to minimize this error. 2 CMAC and Adaptive Control
In learning the inverse dynamics of a manipulator, the CMAC has generally been used in control structures similar to Figure 3A: Here the CMAC maps a joint trajectory into torques, and this torque is then added to the controller's output (Miller et al, 1987, 1990). Our results will show the limitations of this approach, and a n alternate approach will be presented where the CMAC learns a virtual equilibrium trajectory rather than joint torques (Fig. 3B). Consider the control system of Figure 3A. At time f, the arm is at an observed state Q,(t) = [ O ( t ) b ( t ) e ( t ) ] . We would like the arm to be at +(t),which is the equilibrium position generated by the minimum jerk trajectory. Actuators compare +(t ) to the currently observed position O ( t ) , and produce a torque T,(t) = K [ + ( t ) - @ ( t -Bb(t). )] Next, a desired state Q d ( t )= [f3(t)b(t)e,(t)] is produced and given to the CMAC, which produces a torque T,(t). T,(t) + T,(t) is applied to the arm. b,(t) is calculated as follows: From our current position O ( t ) and velocity &'(t),
440
Reza Shadmehr
A
B Qd
-
'"L 1
Trajectory Gcricroror
Afnrcle-like Acltralors d
Tm= K ( $ + $ - H ) - B B
arm
-
rvrwaldC I ~ ~ W ~ B ~ C S cc("nIir,,a
441
Equilibrium Trajectories to Control a Robot Arm
we wish to accelerate at a rate of d d ( t ) for 50 msec to reach 4(t). A second-order polynomial approximation method was used to solve for @ d ( t ) . Next, the effects of the applied torques are calculated from the forward dynamics, and Q,(t A) is observed. Q,(t A) is given to the CMAC and the output T , ( t + A ) is compared to T c ( t ) + T , ( t )Finally, . the contents of the CMAC‘s activated output cells are updated by an amount proportional to T J t )+ T,(t) - T,(t A). The feedback loop serves as the teacher to the CMAC in Figure 3A: It provides reasonably appropriate torques as the equilibrium trajectory 4(t)deviates from O ( t ) . But since the output of this controller never vanishes (due to the Be term), the inverse dynamics map Q, + T , that is learned by the CMAC will always be ”corrupted” by the controller’s output. In Experiment 1 we will see that after learning, the CMAC’s performance can be improved further if the feedback controller is disconnected. The control scheme in Figure 3B is an alternate approach where the CMAC learns to control the controller, rather than augmenting its output. Here the output of the CMAC is not a torque, but a joint position vector is the virtual equilibrium position, and +++e is the virtual error in position that is given to the controller. Experiments 2 4 are examples where the CMAC learns a Q,) + mapping. In these experiments, the CMAC compensates for the dynamics of the plant as well as those of the controller.
+
+
+
+. +++
+
2.1 Learning Inverse Dynamics. For the arm in Figure 1, the state vector Q consists of six parameters: Shoulder and elbow position (bounded within -1 and 3 rad), velocity (bounded within -20 and 20 rad/sec), and acceleration (bounded within -100 and 100 rad/sec*). CMAC (Albus 1975) is one method for mapping this six-dimensional space onto a two-dimensional space, which after learning will represent joint torque in the case of the control system in Figure 3A, and virtual joint position in the case of Figure 3B. 2.2.1 Network Architecture. In the first layer of our CMAC, 400 input sensors encoded each parameter (so the input space was quantized into 4006 segments and, for example, joint position was encoded at a resolution of 0.01 rad). Each input sensor had an 80-unit-wide receptive field, and each segment within this input space mapped onto a second layer of cells that acted as state sensors (there were 1.25 x lo6 state sensors). For each input a unique set of 80 state sensor cells would be activated. Although this encoding reduced the memory requirements for representing the input space by a factor of lo9, nevertheless, if each state sensor pointed to a memory location containing two real numbers, that is, a torque for each joint, then the required memory space would be 10 MB, exceeding the author’s available machine memory. Following Miller
Reza Shadmehr
442
et al.’s (1987) approach, a second mapping was done to overcome this problem: Initially, a n output table containing 31,250 cells (each holding two real numbers) was constructed. Then a many-to-one mapping was performed from the state sensors to the output cells using a hashing function (an output cell had -40 state sensors mapping onto it). For an input Q, the contents of all activated output cells were summed to form T , .
2.1.2 Experiment 1. Here the CMAC attempted to learn the torque sequence necessary for moving the arm along the same trajectory as that shown in Figure 2A. We began with the output cells all set to zero. At this stage, the Rh4SE was 0.1146 rad. On the first attempt with the learning scheme in place, the RMSE was reduced to 0.0810. By the tenth such attempt, the RMSE was at 0.073 (Fig. 28). Note that the overshoot and oscillatory behavior of the feedback controller have been eliminated, yet the arm still lags the desired trajectory. To see the contribution of the CMAC as compared to the controller for this movement, we plotted the shoulder torque generated by each system before (Fig. 4A) and after (Fig. 4B) learning. In Figure 4A, the applied torque is the response of a damped-spring, and the CMAC’s contribution is zero because its content has been initialized. After the tenth iteration (Fig. 4B), the CMAC’s output has begun to dominate the output of the feedback controller. At this point, the controller was removed
a4 02
0 4
0 6
0 8
l h e fsecj
10
1 2
1 4
.
I
,
0 2
0 4
. , . , . , . , 06
0 8
1.0
1 2
.
, 1.4
.
Time isec)
Figure 4: Torque contribution of the controller (solid line) and CMAC (dotted, more “noisy“ line) to the shoulder joint during movement of Figure 2, using the control scheme of Figure 3A. (A) Before learning: All of the torque is due to the controller. (B) After the tenth learning iteration: Most of the torque is due to the CMAC.
Equilibrium Trajectories to Control a Robot Arm
443
from the loop and the same trajectory was repeated. This improved the performance of the CMAC by a factor of 5 (RMSE = 0.013 rad). We concluded that although the CMAC had essentially learned the inverse dynamics of the plant, it could not compensate for the effects of the controller because it had not witnessed the input/output behavior of the lumped system. 2.2 Learning Virtual Trajectories. We investigated the utility of the control structure depicted in Figure 3B by performing three experiments. In all cases, the CMAC was identical to the one used in the Experiment 1. The learning scheme was as follows: At time f , for a virtual position error +(t)+ @ ( t ) O ( t ) , an observed state vector Q,(t + A ) was produced. Q,(t + A ) was given to the CMAC, which produced +(t + A), and the activated output cells were updated by an amount proportional to + ( t )
+
4(t) @ ( t )- +(t + A). ~
2.2.1 Experiment 2. The movement in Figure 2A was repeated with the control scheme of Figure 3B. Initially, the RMSE was 0.1146 rad. After the first iteration, RMSE fell to 0.0704. Following the tenth itera-
elbow
sllouider
1
0.5
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Time (sec) Figure 5: The equilibrium trajectory (solid lines) and virtual equilibrium trajectory (dotted line, the "noisy" signal) after the tenth learning iteration of the CMAC with the control structure of Figure 3B.
Reza Shadmehr
444
tion, RMSE was at 0.0086 rad. We have plotted the virtual equilibrium trajectory +(t) 4(t) along with the equilibrium trajectory 4(t)in Figure 5. Intuitively, one would expect that in order to accelerate the arm along a particular trajectory, a larger than observed positional error would have to be presented to the spring-like muscIes in order to start the movement. Using the same analogy, an early reversal in joint positional error terms needs to be implemented in order to brake the system at some position, and not have it oscillate there. Note that in Figure 5, the virtual trajectory is leading the equilibrium trajectory, and then braking and finally clamping it at the goal position.
+
2.2.2 Experiment 3. Here we tested the system on a much faster movement. In Figure 6A we have plotted the response of the system before learning begun (RMSE = 0.1571 rad). The performances of the CMAC after the first, tenth, and the fortieth iteration are plotted in Figures 6B, C, and D, respectively. The RMSE associated with these iterations were 0.1481, 0.0312, and 0.0089 rad. 2.2.3 Experiment 4. Can the system of Figure 38 learn the dynamics of the arm for a wide range of movements in a nonrepeating protocol? To investigate this, the initial position of the hand was fixed at (0.0 0.3), and the target position was randomly selected along the perimeter of a circle with radius of 0.25. The movement time was randomly selected within the range of 0.5 to 1.0 sec. After 500 such center-out movements, the average RMSE for the next 10 movements was 0.0089 rad, compared to an average RMSE of 0.1029 rad for the case where the CMAC was a
tabula rasa. 3 Conclusions
In this work we have been concerned with the problem of learning inverse dynamics of a plant and its controller. We used the example of a robot arm with muscle-like actuators to illustrate the limitations of the learning/control system of Figure 3A. In Experiment 1 it was shown that after learning, the performance of the CMAC could be further improved if the control loop was disconnected and the CMAC allowed to run in a feedforward mode. It was suggested that instead of learning to modify the controller’s output, the CMAC should be set up to augment the controller’s input, therefore learning a virtual equilibrium trajectory (Experiments 2-41, rather than joint torques. The basic principle is to control a controlled system by supervised learning on the lumped system’s input/output relationship. Recent results on use of the CMAC have shown it to be a particularly useful approach for rapid learning of nonlinear functions (as compared to backpropagation, for example). This is because (1) the network uses local
Equilibrium Trajectories to Control a Robot Arm
445
B
054
.
I
02
.
I
04
.
,
,
06
08
.
, 1 0
.
, 1 2
,
, 1 4
. 20-
ribor
15-
10-
D
05-
,b&
00-
-05
05
02
04
06
08
Time fsecl
10
52
1 4
02
04
06
08
10
12
1 4
71nle Isec)
Figure 6: Learning a fast movement using the control scheme of Figure 3B. (A) Before learning begun (RMSE = 0.1571 rad). (B) The first learning iteration (RMSE = 0.1481). (C) The tenth iteration (RMSE = 0.0312). (D) The fortieth iteration (RMSE = 0.0089).
representation of the input space, thus requiring evaluation and modification of only a few output celIs, and (2) the learning is a quadratic optimization procedure, avoiding the problem of local minimas. The mapping from the input space onto the output cells is the key to the CMAC: The idea is to not only activate a unique set of output cells for each input vector, but do so in such a way that similar input states share a large number of output cells, while far-away input states share no output cells. Recently, Moody (1989) has suggested two improvements to this basic network architecture. These include the use of a neighborhood function with graded response to overcome the potential problem of response discontinuity over state boundaries, and a multiresoIution
446
Reza Shadmehr
interpolation scheme where a hierarchy of CMACs work in parallel to provide high resolution along with good generalization abilities. Acknowledgments The author is supported by a n IBM Graduate Fellowship in Computer Science. This work has been supported by a Grant-in-Aid of Research from the Sigma-Xi Foundation, a n d the NIH Grant 1ROI-NS24926 (Prof. Michael Arbib, Principal Investigator). I a m most grateful for the help of Prof. Tom Miller in constructing the CMAC. References Albus, J. S. 1975. A new approach to manipulator control: The cerebellar model articulation controller (CMAC). Trans. A S M E I. Dynamic Syst. Meas. Contr. 97,220-227. Flash, T. 1987. The control of hand equilibrium trajectory in multi-joint arm movements. Biol. Cybernet. 57, 257-274. Hogan, N. 1984a. Adaptive control of mechanical impedance by coactivation of antagonist muscles. I E E E Trans. Autom. Confr. AC-29(8), 681490. Hogan, N. 1984b. An organizing principle for a class of voluntary movements. J. Neurosci. 4(11), 2745-2754. Kraft, L. G., and Campagna, D. I? 1990. A comparison between CMAC neural network control and two traditional adaptive control systems. I E E E Control Syst. Magazine 10(3), 3643. Miller, W. T., Glanz, F. H., and Kraft, L. G. 1987. Application of a general learning algorithm to the control of robotic manipulators. Int. J. Robotics Res. 6(2), 84-98. Miller, W. T., Hewes, R. P., Glanz, F. H., and Kraft, L. G. 1990. Real-time dynamic control of an industrial manipulator using a neural-network-based learning controller. I E E E Trans. Robotics Automation 6(1), 1-9. Moody, J. 1989. Fast learning in multi-resolution hierarchies. In Advances in Neural information Processing Systems, D. S. Touretzky, ed., pp. 29-39. Morgan Kaufmann, San Mateo, CA. Uno, Y., Kawato, M., and Suzuki, R. 1989. Formation and control of optimal trajectory in human multijoint arm movement: Minimum torque change model. Bid. Cybernet. 61, 89-101.
Received 1 February 90; accepted 6 August 90.
Communicated by James McClelland
Discovering the Structure of a Reactive Environment by Exploration Michael C. Mozer Department of Computer Science and Institute of Cognitive Science, University of Colorado, Boulder, C O 80309-0430 U S A Jonathan Bachrach Department of Computer and Information Science, University of Massachusetts, Amherst, M A 01003 U S A Consider a robot wandering around an unfamiliar environment, performing actions and observing the consequences. The robot's task is to construct an internal model of its environment, a model that will allow it to predict the effects of its actions and to determine what sequences of actions to take to reach particular goal states. Rivest and Schapire (1987a,b; Schapire 1988) have studied this problem and have designed a symbolic algorithm to strategically explore and infer the structure of "finite state" environments. The heart of this algorithm is a clever representation of the environment called an update graph. We have developed a connectionist implementation of the update graph using a highly specialized network architecture. With backpropagation learning and a trivial exploration strategy - choosing random actions - the connectionist network can outperform the Rivest and Schapire algorithm on simple problems. Our approach has additional virtues, including the fact that the network can accommodate stochastic environments and that it suggests generalizations of the update graph representation that do not arise from a traditional, symbolic perspective. 1 Introduction
Consider a robot wandering around an unfamiliar environment, performing actions and observing the consequences. With sufficient exploration, the robot should be able to construct an internal model of the environment, a model that will allow it to predict the effects of its actions and to determine what sequences of actions must be taken to reach particular goal states. We describe a connectionist network that accomplishes this task for environments that can be characterized by a finite-state automaton (FSA). In each environment, the robot has a set of discrete actions it Neural Cornputation 2, 447-457 (1990) @ 1990 Massachusetts Institute of Technology
448
M. C. Mozer and J. Bachrach
Figure 1: The three-room world. This environment consists of three rooms arranged in a circular chain. Each room is connected to the two adjacent rooms. In each room is a light bulb and light switch. The robot can sense whether the light in the room where it currently stands is on or off. The robot has three possible actions: move to the next room down the chain (D), move to the next room up the chain (U), and toggle the light switch in the current room ( T f .
can execute to move from one environmental state to another. At each environmental state, a set of binary-valued sensations can be detected by the robot. To illustrate the concepts and methods in our work, we use as an extended example a simple environment, the n-room world (Fig. 1).
2 Modeling the Environment
If the FSA corresponding to the n-room world is known, the sensory consequences of any sequence of actions can be predicted. Further, the FSA can be used to determine a sequence of actions required to obtain a certain goal state. Although one might try developing an algorithm to learn the FSA, Schapire (1988) presents several arguments against doing so. Most important is that the FSA often does not capture structure
Structure of a Reactive Environment
449
inherent in the environment. Rather than trying to learn the FSA directly, Rivest and Schapire (1987a,b; Schapire 1988) suggest learning an alternative representation called an update graph. The advantage of the update graph is that in environments with regularities, the number of nodes in the update graph can be much smaller than in the FSA (e.g., 2n versus 2” for the n-room world). Rivest and Schapire’s formal definition of the update graph is based on the notion of tests that can be performed on the environment, and the equivalence of different tests. In this section, w e present an alternative, more intuitive view of the update graph that facilitates a connectionist interpretation. To model the three-room world, the essential knowledge required is the status of the lights in the current room (CUR),the next room u p from the current room ( U P ) , and the next room down ( D O W N ) . Assume the update graph has a node for each of these environmental variables, and that each node has an associated value indicating whether the light in the particular room is on or off. If we know the values of the variables in the current environmental state, what will their new values be after taking some action, say U? The new value of CUR becomes the previous value of UP; the new value of DOWN becomes the previous value of CUR; and in the three-room world, the new value of UP becomes the previous value of DOWN. As depicted in Figure 2a, this action thus results in shifting values around in the three nodes. This makes sense because moving u p does not affect the status of any light, but it does alter the robot’s position with respect to the three rooms. Figure 2b shows the analogous flow of information for the action D. Finally, the action T should cause the status of the current room‘s light to be complemented while the other two rooms remain unaffected (Fig. 2c). In Figure 2d, the three sets of links from Figure 2a-c have been superimposed and have been labeled with the associated action. One final detail: The Rivest and Schapire update graph formalism does not make use of the ”complementation” link. To avoid it, each node may be split into two values, one representing the status of the light in a room and the other its complement __ (Fig. 2e). Toggling thus involves exchanging the values of CUR and CUR. Just as the values of CUR, UP, and DOWN must be shifted for the actions U and D, so must their complements. Given the update graph in Figure 2e and the value of each node for the current environmental state, the result of any sequence of actions can be predicted simply by shifting values around in the graph. Thus, as far as predicting the input/output behavior of the environment is concerned, the update graph serves the same purpose as the FSA. A defining and nonobvious (from the current description) property of an update graph is that each node has exactly one incoming link for each action. We call this __ the one-input-per-action constraint. For example, CUR gets input from CUR for the action T, from UP for U, and from DOWN for D.
450
fa)
4 DOWN
M. C. Mozer and J. Bachrach
-;Ix DOWN
Figure 2: (a) Links between nodes indicating the desired information flow on performing the action U. CUR represents the status of the lights in the current room, UP the status of the lights in the next room, and DOWN the status of the Iights in the next room down. (b) Links between nodes indicating the desired information flow on performing the action D. (c) Links between nodes indicating the desired information flow on performing the action T. The ”-” on the link from CUR to itself indicates that the value must be complemented. (d) Links from the three separate actions superimposed and labeled by the action. (e) The complementation link can be avoided by adding a set of nodes that represents the complements of the original set. This is the update graph for the three-room world.
3 T h e Rivest and Schapire Algorithm Rivest and Schapire have developed a symbolic algorithm (hereafter the RS algorithm) to strategically explore an environment and learn its update graph representation. The RS algorithm formulates explicit hypotheses about regularities in the environment and tests these hypotheses one or a relatively small number at a time. As a result, the algorithm may not
Structure of a Reactive Environment
451
make full use of the environmental feedback obtained. We thus felt it worthwhile to consider an alternative approach using a connectionist network, called SLUG, that performs subsymbolic learning of update graphs. In the remainder of this paper, we summarize our work on SLUG. For further details and additional results, see Mozer and Bachrach (1990). 4 Viewing the Update Graph as a Connectionist Network
What shouId SLUG look like following training if it is to behave as an update graph? Start by assuming one unit in SLUG for each node in the update graph. The activity level of the unit represents the truth value associated with the update graph node. Some of these units serve as ”outputs” of SLUG. For example, in the three-room world, SLUG’S output is the unit that represents the status of the current room. In other environments, there may be several sensations in which case several output units are required. What is the analog of the labeled links in the update graph? The labels indicate that values are to be sent down a link when the corresponding action occurs. Thus, we require a set of links - or connection weights - for every possible action. To predict the consequences of a particular action, the weights for that action are inserted into SLUG and activity is propagated through the connections. Thus, SLUG is dynamically rewired contingent on the current action. The effect of activity propagation should be that the new activity of a unit is the previous activity of some other unit. A linear activation function is sufficient to achieve this:
where a ( t ) is the action selected at time 1, W,,,t,is the weight matrix associated with this action, and x(t ) is the activity vector that results from taking action a ( / ) . To satisfy the one-input-per-action constraint on the update graph, each row of each weight matrix should have connection strengths of zero except for one value which i s 1. Assuming this is true, the activation rule will cause activity values to be copied around the network. 5 Training SLUG to Be a n Update Graph
Having described a connectionist network that can behave as an update graph, we turn to the procedure used to learn its connection strengths. For expository purposes, assume that the number of units in the update graph is known in advance. (This is not necessary, as we discuss below.) SLUG starts off with this many units, s of which are set aside to represent the sensations. These s units are the ”output” units of the network; the remainder are “hidden” units. A set of weight matrices, {WcL{, is
M. C. Mozer and J. Bachrach
452
constructed - one per action - and initialized to random values; at the completion of learning, these matrices will represent the update graph connectivity. To train SLUG, random actions are performed on the environment and the resulting sensations are compared with those predicted by SLUG. The mismatch between observed and predicted sensations provides an error measure, and the weight matrices can be adjusted to minimize this error. Performing gradient descent in the weight space directly is inadequate, however, because the resulting weights are unlikely to satisfy the one-input-per-action constraint. We achieve this property by performing gradient descent not in the {Wa}, but in an underlying parameter space, {V,}, from which the weights are derived using a normalized exponential transform: el~eL,IT
way
=
Ckp , t r l T
where w,,?is the strength of connection to unit z from unit J’ for action a, v,,? is the corresponding underlying parameter, and T is a constant. This approach permits unconstrained gradient descent in {V,} while constraining the w,,~ to nonnegative values and
1
wa),y =
1
J
(This approach was suggested by the work of Bridle 1990, Durbin 1990, and Rumelhart 1989.) By gradually lowering T over time, the solution can be further constrained so that all but one incoming weight to a unit approaches zero. In practice, we have found that lowering T is unnecessary because solutions discovered with a fixed T always satisfy the one-input-per-action constraint. At each time step t, the training procedure consists of the following sequence of events: 1. An action, a ( t ) , is selected at random.
2. The weight matrix for that action, W a ( ~ is)used , to compute the activities at t, x(t), from the previous activities x(t - 1). 3. The selected action is performed on the environment and the resulting sensations are observed. 4. The observed sensations are compared with the sensations predicted by SLUG (i.e., the activities of units chosen to represent the sensations) to compute a measure of error.
5. The backpropagation ”unfolding-in-time” procedure (Rumelhart et al. 1986) is used to compute the derivative of the error with respect to weights at the current and earlier time steps, Wa(t-t), for z = 0 . . .7- - 1.
Structure of a Reactive Environment
453
6. The error gradient in terms of the {W,} is transformed into a gradient in terms of the {Vc,},and the {V,} are updated.
7. The temporal record of unit activities, x(t - 2 ) for i = 0 . . 7 , which is maintained to permit backpropagation in time, is updated to reflect the new weights. This involves recomputing the forward flow of activity from time t - T to t for the hidden units. (The output units are unaffected because their values are forced, as described in the next step.) I
8. The activities of the output units at time t, which represent the predicted sensations, are replaced by the observed sensations. This implements a form of teacher forcing (Williams and Zipser 1990).
One parameter of training is the amount of temporal history, T , to consider. We have found that, for a particular problem, error propagation beyond a certain critical number of steps does not improve learning performance. In the results described below, we set ‘T for a particular problem to what appeared to be a safe limit: one less than the number of nodes in the update graph solution of the problem. To avoid the issue of selecting a T , the on-line recurrent network training algorithm of Williams and Zipser (1989) could be used. 6 Results
Figure 3 shows the weights learned by SLUG for the three-room world at three stages of learning. The ”step” refers to the number of actions SLUG has taken. By step 2000, the connectivity pattern i s identical to that of the update graph of Figure 2e. The rightmost three units form a counterclockwise loop for the action U, and clockwise for D; the leftmost three units do the reverse. For the action T, the inner two units exchange values while the outer four units hold their current values. In addition to learning the update graph connectivity, SLUG has simultaneously learned the correct activity values associated with each node for the current state of the environment. Armed with this information, SLUG can predict the outcome of any sequence of actions. Because the learned weights and activities are boolean, SLUG can predict infinitely far into the future with no degradation in performance. (This is a striking comparison to other connectionist approaches to learning finite-state automata, e.g., Servan-Schreiber et al. 1988.) Now for the bad news: SLUG does not converge for every set of random initial weights, and when it does, it requires on the order of 2000 steps. However, when the weights are unconstrained (i.e., W, = V,,), SLUG converges without fail and in about 300 steps. We consider why constraining the weights is harmful and suggest several remedies in Mozer and Bachrach (1990). Without constraints, solutions discovered by SLUG do not satisfy the RS update graph formalism. The primary
M. C. Mozer and J. Bachrach
454
Step 0
Step 1000
Mb
D
U
Step 2000
Figure 3: SLUG’S weights at three stages of learning for the three-room world. Each large diagram (with a light gray background) represents the weights corresponding to one of the three actions. Each small diagram contained within a large diagram (with a dark gray background) represents the connection strengths feeding into a particular unit for a particular action. There are six units, hence six small diagrams within each large diagram. The units have been manually arranged for this example so that when learning reaches completion, the six units correspond to the six nodes of Figure 2e. A white square in a particular position of a small diagram represents the strength of connection from the unit in the homologous position in the large diagram to the unit represented by the small diagram. The area of the square is proportional to the connection strength.
Structure of a Reactive Environment
455
RS Environment Little Prince World Three-Room World Four-Room World Car Radio World Checkerboard World 32-Room World 5 x 5 Grid World
Algorithm SLUG 200 408 1,388 27,695 96,041 52,436 583,195
91 298 1,308 8,167 8,192 Fails Fails
Table 1: Number of Steps Required to Learn Update Graph.
disadvantage of these solutions is that they are not readily interpreted: the weight matrices contain a collection of positive and negative weights of varying magnitudes. In the case of the 3-room world, analysis shows that SLUG has discovered the notion of complementation links of the sort shown in Figure 2d. With the use of complementation links, only three units are required, not six. Consequently, the three unnecessary units are either cut out of the solution or encode information redundantly. Table 1 compares the performance of SLUG without constrained weights and the RS algorithm for several environments. Performance is measured by the median number of actions the robot must take before it is able to predict the outcome of subsequent actions. In simple environments, SLUG can outperform the RS algorithm. This result is quite surprising when considering that the action sequence used to train SLUG is generated at random, in contrast to the RS algorithm, which involves a strategy for exploring the environment. We conjecture that SLUG does as well as it does because it considers and updates many hypotheses in parallel at each time step. In complex environments, however, SLUG does poorly. By "complex," we mean that the number of nodes in the update graph is quite large and the number of distinguishing environmental sensations is relatively small (e.g., a 32-room world). An intelligent exploration strategy seems necessary in this case: random actions will take too long to search the state space. Our ongoing work addresses this issue.' 'One approach we are considering is to have SLUG select actions or action sequences that will result in maximal uncertainty in the prediction - i.e., predictions as distant from boolean states as possible. This approach is based on the work of Cohn et al. (1990).
M. C. Mozer and J. Bachrach
456
Beyond the potential speedups offered by connectionist learning algorithms, the connectionist approach has other benefits. 0
Our studies have shown that SLUG is insensitive to prior knowledge of the number of nodes in the update graph being learned. As long as SLUG is given at least the minimal number of units required, the presence of additional units does not impede learning. SLUG either disconnects the unnecessary units from the graph or uses multiple units to encode information redundantly. In contrast, the RS algorithm requires an upper bound on the update graph complexity, and performance degrades significantly if the upper bound is not tight. SLUG is able to accommodate environments with unreliable sensations. Like most connectionist systems, SLUG’s performance degrades gracefully in the presence of noise. While the original RS algorithm was unable to handle noise, recent variants have overcome this limitation (Schapire, personal communication).
0
0
During learning, SLUG continually makes predictions about what sensations will result from a particular action, and these predictions improve with experience. The RS algorithm cannot make predictions until learning is complete; it could perhaps be modified to do so, but there would be an associated cost. Treating the update graph as matrices of connection strengths has suggested generalizations of the update graph formalism that do not arise from a more traditional analysis. First, there is the fairly direct extension of allowing complementation links. Second, because SLUG is a linear system, any rank-preserving linear transform of the weight matrices will produce an equivalent system, but one that does not have the local connectivity of the update graph. SLUG’s linearity also allows us to use tools of linear algebra to analyze the resulting connectivity matrices.
These benefits indicate that the connectionist approach to the environmentmodeling problem is worthy of further study. We do not wish to claim that the connectionist approach supersedes the impressive work of Rivest and Schapire. However, it offers complementary strengths and alternative conceptualizations of the learning problem.
Acknowledgments Our thanks to Clayton Lewis, Rob Schapire, Paul Smolensky, and Rich Sutton for helpful discussions and comments. This work was supported by National Science Foundation Grants IRI-9058450 and ECS-8912623, Grant 90-21 from the James S. McDonnell Foundation, and Grant AFOSR89-0526 from the Air Force Office of Scientific Research, Bolling AFB.
Structure of a Reactive Environment
457
References Bridle, J. 1990. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Advances zn Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 211-217. Morgan Kaufmann, San Mateo, CA. Cohn, D., Atlas, L., Ladner, R., El-sharkawi, M., Marks 11, R., Aggoune, M., and Park, D. 1990. Training connectionist networks with queries and selective sampling. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 566-573. Morgan Kaufmann, San Mateo, CA. Durbin, R. 1990. Principled competitive learning in both unsupervised and supervised networks. Poster presented at the conference on Neural Networks for Computing, Snowbird, UT. Mozer, M. C., and Bachrach, J. 1990. SLUG: A connectionist architecture for inferring the structure of finite-state environments. Machine Learn., accepted for publication. Rivest, R. L., and Schapire, R. E. 1987a. Diversity-based inference of finite automata. Proc. Twenty-Eigkth Ann. Symp. Foundations Cornput. Sci., pp. 78-87. Rivest, R. L., and Schapire, R. E. 198%. A new approach to unsupervised learning in deterministic environments. Proc. Fourth Int. Workslzop Machine Learn., pp. 364-375. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognifion. Volume I: Foundations, D. E. Rumelhart and J. L. McClelland, eds., pp. 318-362. MIT Press/Bradford Books, Cambridge, MA. Rumelhart, D. E. 1989. Specialized architectures for backpropagation learning. Paper presented at the conference on Neural Networks for Computing, April, Snowbird, UT. Schapire, R. E. 1988. Diversity-based inference of finite automata. Unpublished master’s thesis, Massachusetts Institute of Technology, Cambridge, MA. Servan-Schreiber, D., Cleeremans, A., and McClelland, J. L. 1988. Encoding sequential structure in simple recurrent networks. Tech. Rep. CMU-CS-88-183. Carnegie-Mellon University, Department of Computer Science, Pittsburgh, PA. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1, 270-280. Williams, R. J., and Zipser, D. 1990. Gradient-based learning algorithms for recurrent connectionist networks. In Backpropagation: Theoy, Architectures, and Applications, Y. Chauvin and D. E. Rumelhart, eds. Erlbaum, Hillsdale, NJ.
Received 9 February 90; accepted 5 July 90.
Communicated by Christoph von der Malsburg
Computing with Arrays of Coupled Oscillators: An Application to Preattentive Texture Discrimination Pierre Baldi Jet Propulsion Laboratory and Division of Biology, California Institute of Technology, Pasadena. C A 91125 U S A
Ronny Meir Division of Chemistry, California Institute of Technology, Pasadena, C A 91125 U S A Recent experimental findings (Gray et al. 1989; Eckhorn et al. 1988) seem to indicate that rapid oscillations and phase-lockings of different populations of cortical neurons play an important role in neural computations. In particular, global stimulus properties could be reflected in the correlated firing of spatially distant cells. Here we describe how simple coupled oscillator networks can be used to model the data and to investigate whether useful tasks can be performed by oscillator architectures. A specific demonstration is given for the problem of preattentive texture discrimination. Texture images are convolved with different sets of Gabor filters feeding into several corresponding arrays of coupled oscillators. After a brief transient, the dynamic evolution in the arrays leads to a separation of the textures by a phase labeling mechanism. The importance of noise and of long range connections is briefly discussed. 1 Introduction
Most of the current wave of interest in oscillations and their possible role in neural computations stems from the recent series of experiments by Gray et al. (1989) and Eckhorn et al. (1988). In these experiments, performed on anesthetized and alert (Gray et al. 1989b) cats, moving light bars are presented as visual stimuli and neuronal responses are extracellularly recorded from several electrodes implanted in the first visual cortical areas (mostly area 17). The first observation is that groups of neurons, within a cortical column, tend to engage in stimulus specific oscillatory responses in the 40-60 Hz range. The second and most striking finding is the existence of transient intercolumnar zero phase-locking occurring over distances of several millimeters (at least up to 7 mm) and reflecting global stimulus properties. For instance, elongated or collinear moving light bars of specific orientation elicit zero phase-locked periodic Neural Computation 2,458-471 (1990) @ 1990 Massachusetts Institute of Technology
Computing with Arrays of Coupled Oscillators
459
responses in separated columns with nonoverlapping receptive fields. In contrast, uncorrelated oscillations are observed in the case of similar but noncollinear stimuli. One possible interpretation of these results has been advanced in the form of the so-called labeling hypothesis. In the labeling hypothesis, temporal characteristics such as the phases (and/or frequencies) of pools of oscillating neurons are used to encode information, in particular to label various features of an object by synchronous activity of the corresponding feature extracting neurons (von der Malsburg 1981). Phase-lockings then serve to link associated features in different parts of the visual field (see also von der Malsburg and Schneider 1986) and, in particular, to represent the coherency of an object. These results and hypothesis suggest that large arrays of coupled oscillating elements, where computations are carried by the transient spatial organization of phase and frequency relationships, may be an important component of neural architectures. If so, one natural area where oscillator networks could be of use is in early vision processing tasks. Clearly, the two central issues are the origin of the distant synchronizations observed by Gray et al. and whether a useful role for oscillations exists (for instance in the form suggested by the labeling hypothesis), in natural as well as artificial neural systems. In what follows, we first describe how simple coupled oscillator models can be used to account for the experimental data and investigate the more general computational issues. We then apply these concepts to a specific problem in early vision: preattentive texture discrimination. We demonstrate how in a simple oscillator architecture, textures in an image can be separated by the synchronization of populations of oscillators corresponding to each texture. 2 Coupled Oscillator Models
A classical way of modeling coupled oscillators is based on the observation that, once relaxed to its limit cycle, one oscillator can be described by a single parameter: its phase 0, along the cycle. The behavior of a population of TL interacting oscillators can then be approximated by the system
(2.1) where the variables w,( t ) represent the internal frequencies and/or the external driving inputs, when their action is independent of the current phases. The functions fi take into account the coupling among the oscillators assuming that such effects depend only on the phases. The oscillators are located at the vertices of a graph of interactions and, typically, the coupling functions f,are symmetric of the form CIEV, .f(B, - O , ) , where
P. Baldi and R. Meir
460
V, is the set of vertices j adjacent to i and f is an odd periodic function such as f(8) = sin(@).Thus, assuming a constant coupling strength, (2.1) typically becomes
For instance, Cohen et al. (1982) used a one-dimensional version of (2.2) with nearest neighbor coupling to analyze the lamprey locomotion. More general one-dimensional models have been investigated in great detail by Kopell and Ermentrout (1986, 19881, by studying under which conditions their solutions can be approximated by the solutions of the partial differential equation obtained in the continuum limit. When wi = 0 (or a constant) and the graph of interactions is the two-dimensional square lattice, (2.2)yields the well known XY model of statistical mechanics (see, for instance, Kosterlitz and Thouless 1973). Fully interconnected versions of (2.2) have also been studied (Kuramoto and Nishikawa 1987). It should be kept in mind that, in general, the coupled limit cycle approach yields satisfactory results provided two conditions are satisfied: (1) the population of nonlinear oscillators should be fairly homogeneous; and (2) the oscillators should not operate in a regime that significantly perturbs the wave form and amplitude of their stable limit cycles, which implies that the coupling strengths (and the external inputs or noise, if any) should not be too strong. Simulations in two dimensions (and analytical results in one dimension) show that synchronization of oscillators, over distances consistent with the data, seldomly occur if only local nearest neighbor connections are used. To account for the distant phase-lockings, Kammen et al. (1989) assume the existence of a common feedback and consider a comparator model of the form do, dt
-=
wi + K sin(@- @z )
(2.3)
where 8 is the average phase (C,O,/n).A different possibility we suggest is to assume that the synchronization of groups of neurons in remote columns rests on the coupling induced by the long horizontal connections, extending over several millimeters, between columns of similar orientation preference (see, for instance, Ts’o et al. 1986). To model the data of Singer and Gray along these lines, consider a caricature of the cortical surface as in Figure 1, where ocular dominance stripes p d orientation columns intersect orthogonally. The phase of a neuron or of a group of coherently oscillating neurons within a cortical column is represented by a variable 8i and satisfies (2.4)
Computing with Arrays of Coupled Oscillators
I
461
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Figure 1: A caricature of the cortical surface (area 171, with two excited regions, one for each stimulating horizontal bar, and showing the ocular dominance columns intersecting the orientation stripes orthogonally. Long-range connections, over several millimeters, between columns with similar orientation are indicated. The combined receptive fields of the cells within one of the excited regions cover exactly the region of space corresponding to one of the moving bars. Within an excited region, activity is maximal for the cells with optimal orientation (dark stripes). where q 2 ( t ) is gaussian noise with a fixed variance. The frequency w is zero except in the regions that have been activated by the stimulus. For an excited region, corresponding to several columns with combined receptive fields covering the stimulus, the frequency w is set to some constant value. Now, within such an oscillating region, neural activity is variable, maximal for neurons corresponding to the optimal orientation and almost nonexistent for oscillators associated with the orthogonal direction (i.e., for columns where, most likely, the oscillation is mainly subthreshold). This aspect can be modeled by introducing an amplitude A,(t), which is zero everywhere except in an excited region where, to a first approximation,
=
{
if z has optimal or near optimal orientation if the oscillations at z are mostly subthreshold
(2.5)
with a > b > 0 (of couse, one can consider a continuum of amplitudes). Finally, for the coupling strengths K z J ,different cases would need to be considered depending on the distance between 7 and and also the amplitude of the activity at these sites. From a formal standpoint, such a description is only a discretized version of a model shown to fit the data by Sompolinsky et al. (1990) (although their interpretation of some
P. Baldi and R. Meir
462
of the variables seems to differ), which can be summarized by (2.4) and two additional equations. =
A(A0) in an excited region otherwise
{0
(2.6)
where A0 represents the difference between the preferred orientation at i and the orientation of the local stimulus and A is, for instance, a gaussian function. In addition,
where K [ d ( i , j ) ]takes into account all the connections: from the very short, presumably roughly isotropic, to the long ones, extending over several millimeters and dependent on similarity of orientation. Notice that (2.7) does not necessarily require to postulate the existence of fast hebbian synapses. It only models the coupling strength between two oscillators as being dependent on the level of activity present in the two neural regions they represent. Finally, it appears from anatomical and physiological studies that the excitation due to intracortical connections by far exceeds the excitation mediated by thalamic afferents corresponding to the inputs. Thus, if (2.4) is to be used as a cortical model, the parameters should be chosen so that the dominant contribution arises from the coupling terms. Synchronization issues in neural networks have also been addressed in Atiya and Baldi (1990) using continuous analog and integrate and fire model neurons and in Wilson and Bower (1990) using more detailed compartment models. In Lytton and Sejnowski (19901, compartment models are also used to demonstrate the synchronizing action that inhibitory interneurons such as basket or chandelier cells may have on cortical pyramidal cells. Although more experimental data are critically needed, a consensus seems to emerge that phase-locking of remote groups of neurons results either from common feedback projections, either from the action of long-range horizontal connections, or, of course, from various possible combinations of both mechanisms. In any event, the experimental data suggest that it may be worth investigating the properties of a new type of “neural” network, consisting of large arrays of coupled oscillators where computations are carried through rapid spatiotemporal self-organization of coherent regions of activity. The coupled oscillator models described here are extremely simplistic and are not meant to closely fit what is presently known of cortical neuroanatomy and neurophysiology. Rather, a minimal set of assumptions is introduced and plenty of room exists for successively incorporating more realistic details and manipulating more complex dynamics. One advantage of the relative simplicity of this approach is that fairly large systems can be simulated on a digital computer. Since several models have already been proposed, it becomes even more important to try to assess whether there are any intrinsic computational advantages to oscillator networks and
Computing with Arrays of Coupled Oscillators
463
ARRAYS
OF OSCILLATORS
FILTERS
IMAGE
Figure 2: The basic architecture: the image is first convolved with several banks of Gabor filters, which, in turn, excite several arrays of coupled oscillators. whether they could be used for certain specific tasks. A possible natural domain of application for networks of oscillators seems to be early vision. As a specific example, we have attempted to apply the previous concepts to the problem of preattentive texture discrimination.
3 Texture Discrimination
The real world is seldom constructed of homogeneous objects whose boundaries are given by luminosity gradients. Many natural images contain regions composed of different microfeatures (texture elements) that repeat in some quasiperiodic manner to cover the surface. Examples of textured surfaces could be fabric, lawn, water bubbles, etc. In this context, it would be of interest to understand how the visual system manages to find boundaries between objects, where no luminosity gradient exists. Over the past 15 years or so, much work in early vision has focused on how regions composed of different texture elements are segmented. In particular, Julesz (see, for example, Julesz 1984) has investigated several classes of artificial textures and found that certain pairs of textures can be preattentively discriminated, while others require serial search. Although the sharp division between preattentive and attentive texture segmentation has been questioned in recent years (Gurnsey and Browse 19871, it seems that many texture discrimination problems are indeed low level (i.e., parallel and bottom up).
464
P. Baldi and R. Meir
A few algorithms have been described in the literature that seem to achieve texture discrimination abilities similar to those of humans. Two recent contributions are by Fogel and Sagi (1989) and Malik and Perona (1990). These algorithms are often motivated by analogies with the known neurophysiology (see Van Essen et al. 1989 and references therein). In particular, the first operation carried out in layer IV of primary visual cortex is believed to consist in part of a filtering of the image through a set of feature detectors of varying orientations and spatial frequencies. It would thus be reasonable to surmise that this initial filtering process is a necessary requirement for any model attempting biological plausibility. Since there is little experimental data to guide us in understanding what the visual system does with this raw fiItered image, most current
Figure 3: (a) An example of a 64 x 64 texture used in our simulations. The size of the microfeatures is 4 x 4 pixels. (b) [facing page] The receptive fields of 13 different even filters are represented on the left. On the right, the energies associated with the corresponding convolutions with the texture of a. Intensity levels are coded by colors. The first filter (top left) is just a laplacian operator. The coefficients in equation 3.1 used to generate the remaining 12 Gabor filters are u = 3, v = 2nnf4 with n = 0.6, 1.2, and 1.8, and a = 0, nf4, xf2, 3x14, n. Figure 4: Facing page. (a) The temporal evolution of an oscillator array corresponding to one type of filter (shown at the top). Phases are coded by colors and the entire sequence corresponds to 2 oscillator cycles. Parameters, for all the simulations, are LJ = 2, [E(9’)]1/2= 0.4, K = 60, and T = 0.02, and boundary conditions are free. (b) Same as a, but on a slower time scale. The entire sequence corresponds to 5 oscillator cycles.
Computing with Arrays of Coupled Oscillators
4(a
465
P. Baldi and R. Meir
466
algorithms diverge at this point. Assuming that the textured image consists of a pair of preattentively discriminable textures (such as in Fig. 3a), an early visual channel must exist that can discriminate between the two. The problem faced by any system at this stage then, is to use the biased response of at least one of the filters to segment the image quickly and robustly. Fogel and Sagi (1989), for instance, proposed first smoothing the filtered image with a gaussian filter, applying some noise reduction techniques, and then using a laplacian operator to detect the boundaries. Malik and Perona (1990) used a half-wave rectification stage combined with lateral inhibition between different filters, and demonstrated its effectiveness in reproducing most of the known psychophysical data. In what follows, we address the problems of smoothing noisy response profiles and enhancing boundary formations between regions corresponding to different textures using oscillator architectures. Our main goal is to exploit the dynamics of the system to perform the computational task, without recourse to further filtering and smoothing operations which, at this stage, have not been found. Thus, although we apply our ideas to texture segmentation, we are mainly concerned with the dynamic aspects of such processing. We have considered simple but nontrivial texture images, of size 64 x 64 (see Figs. 3 and 5), constructed so as to avoid differences in luminosity among texture patterns. As already mentioned, oscillator arrays cannot be used alone but only as a moduIe of a more complex system, a few processing stages away from the sensory interface. Thus, the images are first convolved with several 32 x 32 banks of Gabor filters (see, for instance, Daugman 1985) of even and odd symmetry with different orientations and spatial frequencies. A filter with coordinate center (a, b), spatial frequency u, width cr, and orientation a can be described by the convolution kernel ,-[(.-.)2+(1/-b)21/2.2
sin[v(z - u ) cos a
-
v(y - b) sin a
+ 41
(3.1)
The responses of the odd (4= 0) and even (4= 7r/2) filters are squared and summed to give the energy at each orientation and spatial frequency (see Fogel and Sagi 1989). We emphasize that our aim is not to find the best filters nor the most biologically plausible representation for them, but in the dynamic segmentation of the image. Our algorithm differs from currently existing ones by feeding the response of each bank of filters into a corresponding 32 x 32 array of oscillators satisfying equations (2.4) and (2.7) (see Figs. 2, 3b and 5b). The basic idea is to use the phase-locking properties of long range coupled oscillators (Kuramoto and Nishikawa 1987)to achieve separation in time between the figure (one type of texture) and the ground (the other texture). Moreover, as is demonstrated in Figures 5 and 6, the same approach can be used to detect signals in noisy environments in a very fast and parallel fashion. Here, the amplitude A,(t), in a given array, is equal to the energy of the corresponding pair of Gabor filters and does not vary with time. The frequency w is chosen
Computing with Arrays of Coupled Oscillators
467
to be constant and identical for all the oscillators, and 7jz are independent identical gaussian random variables. The connection strength pattern K [ d ( ij)] , is zero everywhere except for two oscillators i and j belonging to the same array and contained in a square with side of length 8, in which case the coupling assumes a constant value K . In particular, at this stage, there are no connections between oscillators pertaining to different arrays associated with different banks of Gabor filters. In simulations, we have found that the prescription for the couplings K,, given by (2.7) is often insufficient to generate reasonable phase boundaries and therefore we have replaced it by the more general form
K,,
= A,A,K[d(z,j ) ] F ( A , T ) F ( A j-
(3.2)
where T is some threshold value, A, is the average of the activity in a neighborhood of i (taken here to be a square with side of length 51, and F ( r ) = 1 if z > 0 and 0 otherwise. Examples of the evolution, from a random initial state, of the phases of the oscillators in some of the arrays are shown in Figures 4 and 6. It can be seen that very rapidly, within a few oscillator cycles (typically 2 to 41, the figure texture oscillates coherently (against a random background) in one of the arrays. The separation between the regions is certainly not sharp. However, precise boundaries should not be expected in preattentive vision. Furthermore, sharper contours could easily be achieved by introducing additional mechanisms such as fast hebbian synapses coupled with the oscillator dynamics or finer threshold adjustments. No attempt has been made here to optimize any of the parameters or the connectivity. This work is intended only as a demonstration and many refinements seem possible. In particular, interactions between the different filters and/or oscillators should be investigated. Similarly, only a restricted set of textures has been generated for this study. In conclusion, a demonstration has been given that, at least in principle, preattentive texture segmentation can be solved by a temporal phase coherence mechanism and implemented with simple arrays of coupled oscillators. The basic components of our algorithm are the phase-locking of strongly excited, distantly coupled oscillators corresponding to the figure and the decoupling of weakly excited, weakly coupled oscillators corresponding to the ground. Two critical parameters should be emphasized: the extent of the lateral connections and the active role played by the noise. In simulations of the XY model with nearest neighbor couplings, we have observed that, starting from a random initial configuration, these systems tend to rapidly organize themselves into a patchy structure of phase-locked regions, a few oscillators in diameter or so. This in turn suggests that, if synchronizations are to be achieved over significantly greater distances, longer connections become a necessity. Analytically, it can be shown that fully interconnected systems tend to synchronize below a certain critical temperature (see Kuramoto and Nishikawa 1987). In the case of visual images, texture regions naturally
I? Baldi and R. Meir
468
Figure 5: (a) Similar to Figure 3a, but with a different image. Only 9 filters are shown.
cover several degrees of visual field corresponding to several millimeters of primary visual cortex [the magnification factor, at excentricity E , is roughly given by M = (0.8 E)-'.' mm/deg (Van Essen et al. 1984)l. This is not inconsistent with the anatomical evidence for the existence of long-range connections within cortical areas and the fact that, in simulations, couplings over ranges roughly comparable to the size of a typical figure are needed for phase-locking of the oscillators within the figure region. The noise also plays an important role: it has a small effect on the figure where the effective couplings are large but is essential in decoupling the background by randomizing the phases. The same principles used here could perhaps be applied to other problems in vision such as, for instance, illusory conjunctions and contour filling type of experiments. It is too early and beyond the scope of this article to compare the performance of our algorithm to others or to try to match it with the available psychophysical data. The mechanisms described should be refined and
+
Figure 5: Facing page. (b) Similar to Figure 3b, but with a different image. Only 9 filters are shown. Figure 6: Facing page. (a,b) Evolution of two arrays of oscillators corresponding to two different filters shown at the top of each picture. Each sequence corresponds to 2 oscillator cycles.
Computing with Arrays of Coupled Oscillators
5(
6
469
470
P. Baldi and R. Meir
tested on a variety of different situations. Similarly, it is premature to try to assess whether oscillator architectures present any intrinsic computational advantages. Certainly, oscillator networks have remarkable properties of robust and very rapid self-organization and naturally utilize the temporal dimension. Hardwired implementations of coupled oscillators do not seem to pose, at least in principle, any conceptual obstacles. On the other hand, the same tasks described here could be achieved with stationary signals and it should be kept in mind that little information per se is contained in a group of coherently firing neurons. If coupled oscillators are to be used in a computation, significant additional machinery is required to process and route the information.
Acknowledgments This work is supported by a McDonnell-Pew Grant and ONR Contract NAS7-100/918 to P. B., ONR Grant N00014-87-K-0377 (J. J. Hopfield, P. I.), and a Weizmann Fellowship to R. M.
References Atiya, A,, and Baldi, P. 1989. Oscillations and synchronizations in neural networks. An exploration of the labeling hypothesis. Int. .! Neurul Syst. 1(2), 103-124. Cohen, A. H., Holmes, P. J., and Rand, R. H. 1982. The nature of the coupling between segmental oscillators of the lamprey spinal generator for locomotion: A mathematical model. I. Math. B i d . 13, 345-369. Daugman, J. G. 1985. Uncertainty relations for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. 1. Opt. SOC. Am. 2(7), 1160-1169. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboek, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? B i d . Cybernef. 60, 121-130. Fogel, I., and Sagi, D. 1989. Gabor filters as texture discriminators. Bid. Cybernet. 61, 103-113. Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989a. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Gray, C. M., Raether, A., and Singer, W. 1989b. Stimulus-specific intercolumnar interactions of oscillatory neuronal responses in the visual cortex of alert cats. 19th Neuroscience Meeting, Phoenix, Arizona (Abstract). Gurnsey, R., and Browse, R. A. 1987. Aspects of texture discrimination. In Computational Processes in Human Vision, Z. Pylyshyn, ed., Ablex, New York. Julesz, B. 1984. Towards an axiomatic theory of preattentive vision. In Dynamic Aspects of Neocortical Function, G. Edelman, W. Einer, and W. Cowan, eds. John Wiley, New York.
Computing with Arrays of Coupled Oscillators
471
Kammen, D. M., Holmes, I? J., and Koch C. 1989. Cortical architecture and oscillations in neuronal networks: Feed-back versus local coupling. In Models of Brain Function, R. M. J. Cotterill, ed. Cambridge University Press, Cambridge, pp. 273-284. Kopell, N., and Ermentrout, G. B. 1986. Symmetry and phaselocking in chains of weakly coupled oscillators. Commun. Pure Appl. Math. XXXIX, 623460. Kopell, N., and Ermentrout, G. B. 1988. Coupled oscillators and the design of central pattern generators. Math. Biosci. 90, 87-109. Kosterlitz, J. M., and Thouless, D. J. 1973. Ordering, metastability and phase transitions in two-dimensional systems. J. Phys. C 6, 1181-1203. Kuramoto, Y., and Nishikawa, I. 1987. Statistical macrodynamics of large dynamical systems. Case of a phase transition in oscillator communities. I. Stat. Phys. 49(3/4), 569-605. Lytton, W. W., and Sejnowski, T. J. 1990. Inhibitory interneurons may help synchronize oscillations in cortical pyramidal neurons. Presented at the 1990 Conference on Neural Networks for Computing, Snowbird, Utah. Malik, J., and Perona, P. 1990. Preattentive texture discrimination with early vision mechanisms. J. Opt. Sac. A m . A, 7(5), 923-932. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1990. Global processing of visual stimuli in a neural network of coupled oscillators. Preprint. Ts’o, D. Y., Gilbert, C. D., and Wiesel, T. N. 1986. Relationships between horizontal interactions and functional architecture in cat striate cortex as revealed by cross-correlation analysis. J. Neurosci. 6(4) 1160-1170. Van Essen, D. C., Newsome, W. T., and Maunsell, J. H. R. 1984. The visual field representation in striate cortex of the macaque monkey: Asymmetries, anisotrophies and individual variability. Vision Res. 24, 429448. Van Essen, D. C., DeYoe, E. A., Olavarria, J. F., Knierim, J. J., Fox, J. J., Sagi, D., and Julesz, B. 1989. Neural responses to static and moving texture patterns in visual cortex of the macaque monkey. In Neural mechanisms of Visual Perception. Proceedings of the Retina Research Foundation Symposium, Vol. 2, D. M. Lam and C. Gilbert, eds. Portfolio Publishing, Woodlands, TX. von der Malsburg, C. 1981. The correlation theory of brain function. Internal Report 81-2, Department of Neurobiology, Max Planck Institute for Biophysical Chemistry. von der Malsburg, C., and Schneider, W. 1986. A neural cocktail-party processor. Biol. Cybernet. 54, 2940. Wilson, M. A,, and Bower, J. M. 1990. Computer simulation of oscillatory behavior in cerebral cortical networks. Proceedings of the 1989 NIPS Conference, Denver, Colorado. In Advances in NeuraZ Information Processing Systems 2, edited by David S. Touretzky, Morgan Kaufmann, pp. 84-91.
Received 18 April 90; accepted 3 August 90.
Communicated by Richard Lippmann
Robust Classifiers without Robust Features Alan J. Katz Michael T. Gately Dean R. Collins Central Research Laboratories, Texas Instruments Incorporated, Dallas, TX 75265 USA
We develop a two-stage, modular neural network classifier and apply it to an automatic target recognition problem. The data are features extracted from infrared and TV images. We discuss the problem of robust classification in terms of a family of decision surfaces, the members of which are functions of a set of global variables. The global variables characterize how the feature space changes from one image to the next. We obtain rapid training times and robust classification with this modular neural network approach. 1 Introduction Neural networks are finding increasing application in the area of pattern recognition and classification (DARPA 1988). Most pattern recognition problems of practical interest are so complex that they do not easily reduce to simple rule sets; moreover, the computational loads of many pattern recognition problems severely tax present-day serial computers. Neural networks offer not only the potential of parallel computation but an approach to pattern recognition and classification rooted in learning: the networks train on a set of example patterns and discover relationships that distinguish the patterns. But neural network classifiers constructed through training may fail to be robust: success with the training data may mean little when the classifier is tested on data on which it did not train. The success of a learning-based approach depends on how representative the training data are of the complete data set. If the training data are not representative, then the likelihood that the mapping encoded by the network extends to test data is small. How to ensure the robustness of the classifier - whether a conventional or neural network classifier - is an open and fundamentally important issue in pattern recognition and classification. In this paper, we address the issue of robust neural network classification in the context of a particular, but representative, pattern recognition problem: the automatic target recognition (Roth 199@ problem of distinguishing targets from clutter. Neural Computation 2, 472-479 (1990) @ 1990 Massachusetts Institute of Technology
Robust Classifiers
473
2 Robustness: A Framework
We define both targets and clutter in terms of a feature set F = (f,,f2,. . ., f-). For any one image, I,, targets and clutter are separable in the ndimensional feature space. We can, therefore, build a neural network classifier to distinguish the two classes (we only consider multilayer, perceptron-type neural networks). As long as we train and test the neural network classifier with data points from I , or images closely related to I,, we find that the network classifies target and clutter reliably. A problem usually arises when we test with data points from an unrelated image 1,. The classifier often fails to recognize the new target and clutter test points. Target and clutter points from /b still separate in the n-dimensional feature space, but we find little correlation between the set of decision surfaces defined by the data from I, and those defined by the data from Ib. As we go from image I , to image /b, the clusters of target and clutter points (and, hence, the decision surfaces) shift and rotate in complex ways in the n-dimensional feature space (see Fig. 1). Though the features are nonrobust, we can still maintain some robustness if we can determine how the classification problem transforms as a function of some set of image parameters. The features then depend on the parameter set P = (A,, A2,. . . , &), which varies from image to image. These parameters are any variables that affect the feature data (e.g., lighting, time-of-day, ambient temperature, and context of scene). For example, we can gain insight into P from knowledge of the sensor and how it operates. Since the features depend differently on elements of P, variations in the parameter set P lead to complicated transformations of
,'
-
Clutter
Decision Surface from Image Set #1
Target
Image Set #1
Image Set #2
Figure 1: Target and clutter objects are separable within a given image. The statistical meanings of target and clutter, however, change from image to image and result in translations, rotations, and reflections of the decision surface.
A. J. Katz, M. T. Gately, and D. R. Collins
474
5P = ( 1 1 , h2,. . ., As)
NN1 1
MF
5F = ( f l , f2, . . ., fn)
NN2 2 MF
Figure 2: The parameter settings P drive the switch NN,, to route feature data F to a network N N ; , which maps the feature data to a binary output space (target or clutter). points in the feature space. Hence, we replace the original classification problem, which was a mapping from the feature space F to some set of output classes (we consider only two output classes, targets and clutter), with a more complex classification problem, which involves a family of mappings F,defined by the parameter set P. Usually, we will not have a deterministic model that relates mappings and the parameters of P, so we will have to derive the relationship between P and F through training with relevant examples. We can incorporate the parameter set P into the classifier in several ways. First, we can use P to normalize the feature data prior to classification. Second, we can expand the original feature space F from n to n + s dimensions and consider the mapping from an enlarged feature space F' to the output space: M,w : P+' + {O,l}. Third, since P defines a family of mappings, we can view the complex classification as a two step problem: (1) the mapping M p from the s-dimensional parameter space P to the function space F= { M k , M;, . . . , M i } , and ( 2 ) the mappings A4; from the n-dimensional features space F to the output space ( 0 , l}. Alternatively, the order of the mappings M P and M k can be reversed (Hampshire and Waibel 1990).
Robust Classifiers
475
Image set Number of images Location Partially occluded targets
1 2 3 4 5
27 20 7 5 20
G G S S
N
No No No Yes Yes
Table 1: Overview of Image Sets. Our neural network approach is shown in Figure 2. The top network in the figure performs the mapping 11117, which is a switch for selecting one of the m classifiers in the second layer. The networks in the second layer are realizations of the mappings Mb. Modular neural network approaches (Waibel 1989), where the desired mappings are accomplished with several smaller neural networks, typically require less training time than approaches that utilize a single large network to carry out the mappings. Moreover, the modular approach is a means to train the system on specific subsets of the feature data and, therefore, to control the nature of the mapping learned by the networks. 3 Application to Automatic Target Recognition 3.1 Image Data. In this section, we apply a modular neural network approach to an automatic target recognition problem. The objective is to construct a classifier to distinguish various vehicles from background clutter. The data are five sets of bore-sighted infrared (IR) and daytime television (TV) images (256 x 256 pixels). Table 1 shows the number of images in each image set, the location codes, and which image sets have partially occluded targets. Each image set includes views of the targets from different viewing angles and at different distances. Locations S and G were mostly open fields with scattered trees, whereas location N was more heavily forested. Extracting features from the images requires considerable image preprocessing. We perform the preprocessing using conventional algorithms. The first preprocessing step identifies regions in the image of high contrast and with high probability of containing a target. This screening step drastically reduces the overall processing time. High contrast areas are found separately for the IR and TV images. In the second preprocessing step, eight contrast-based and eight texture-based features are extracted from the screened regions (e.g., intensity contrast, intensity standard deviation, and signal-to-noise ratio). For each screened
476
A. J. Katz, M. T. Gately, and D. R. Collins
area in an IR (TV) image, the corresponding set of pixels in the TV (IR) image is marked and features are extracted. We combine features from the same pixel locations in the IR and TV images to form a composite feature vector with 32 features. Fusing of information from the two sensors is crucial to distinguishing targets from clutter: using TV features or IR features alone makes the separation of targets and clutter more difficult. In addition to the 32 features, we extract 12 global features (e.g., total luminosity, maximum image contrast, and overall edge content) from each IR/TV image pair. These global features make up the parameter space P and provide information that relates similar images and differentiates divergent images. We found that the feature space is nonrobust in the sense described in the preceding section. We trained five neural networks without hidden units, one for each of the five image sets, on the 32-component feature vectors. We demonstrated that the five decision boundaries defined by the neural networks are uncorrelated. The largest correlation between any two decision boundaries was only 0.61. We also cross-tested the data by training networks with images from one location and testing with images from the other locations. We found in one test, for example, that the false alarm rate increased by a factor of four (over a network trained on all locations) with no improvement in probability of detection. 3.2 Two-Stage, Modular Neural Network Classifier. To do the classification, we built a two-stage, modular neural network classifier. The first stage consists of an 8 x 5 feedforward neural network, with eight analog input neurons - we input eight of the 12 global features (the other four play a small role in the mapping) - and five output neurons, which designate the mappings M F for the five image sets. This network performs the mapping from the parameter space P to the function space F. The second stage of the classifier contains five independent neural networks, which effect the mappings M F . The structure of the five neural networks are 16 x 1, 6 x 1, 4 x 1, 9 x 1, and 14 x 4 x 1. The number of input neurons is smaller than 32 because we only include features that are relevant to the classification (we use the relative magnitudes of the weight vectors associated with each input neuron to determine the relevant features). All neural networks are trained with the backpropagation (Rumelhart et al. 1986) learning algorithm. Each of the five networks at the second stage train on feature data from three-quarters of the images (randomly selected) from the respective image set: all the image sets are, therefore, represented in the training set. The neural network at the first stage trains on global features extracted from all images in the training set. The remaining one-quarter of the images serve as the test set. Test results (which are averages over 48 runs) for the modular system are shown in Table 2, Row A. The error bars represent the standard deviation for results from the various runs. The two-stage classifier generalizes well even though the input
Robust Classifiers
477
set is noninvariant over the data domain. The generalization rate of the first-stage neural network is 0.96 and indicates that the first-stage neural network performs well as a switch to the appropriate network in the second stage. When no errors are made at the first stage, the test results for the classifier did not improve. To test how essential the global features are for proper generalization, we built a 32 x 16 x 1 neural network that did not include the global features, and we trained the network on three-quarters of the image data (chosen the same way as above). The number of training cycles and the overall training time were a n order of magnitude greater for the single network than for the modular system (1 hr vs. 10 hr on a Texas Instruments Explorer I1 LISP machine with a conjugate-gradient descent version of back propagation); the training and generalization results for the single network, which are shown in Table 2, Row B, were within the error bars of the results for the modular system. These results did not improve when we reduced either the number of input neurons or the number of hidden neurons. We conclude that the advantage of the two-stage, modular system over the single network for this data set is training efficiency. We ran a second set of experiments to demonstrate the generalization advantages of the two-stage approach. The data were two single-sensor (IR) image sets, one with 30 images and the second with 90 images. Sixteen texture-based features were extracted from each high contrast region. Not only were the decision surfaces in the two sets uncorrelated (0.58), but there was considerable overlap between the target cluster in one image set and the clutter cluster in the second set. We expected that the absence of the global features would impair learning and generalization. The training sets for the two-stage, modular and the single networks
Test case Probability of detection Probability of false alarm
A
B C D
0.85 =t 0.06 0.82 f 0.03 0.99 0.01 0.96 =t 0.01
+
+
0.18 0.06 0.20 f 0.03 0.00 f 0.01 0.05 0.01
*
Table 2: Network results are expressed as p d , the probability of detection and Pfa,the probability of false alarms. is the number of targets correctly identified as targets divided by the total number of targets. is the number of clutter objects incorrectly identified as targets divided by the total number of objects identified as targets. First set of experiments: (A) two-stage, modular system (numbers represent an average over 48 runs), (B) 32 x 16 x 1 network (numbers represent an average over six runs). Second set of experiments: (C) two-stage, modular system, (D) 16 x 32 x 1 network.
478
A. J. Katz, M. T. Gately, and D. R. Collins
were identical. The results are shown in Table 2, Rows C and D. The generalization rates for the two-stage, modular system were nearly perfect ( p d = 0.99 & 0.01 and Pf, = 0.00 & 0.01). The structure of the two neural networks at the second stage were both 16 x 1. The single neural network had the structure 16 x 32 x 1 (networks with smaller numbers of hidden units gave worse results while the results did not improve with a network with 64 hidden units). The generalization rates for the single network were p d = 0.96 i0.01 and Pr, = 0.05 i0.01. These results are worse than the results for the two-stage, modular system. 3.3 Discussion. A key parameter in the two-stage, modular network approach is the number of independent networks at the second stage. What determines this parameter is the number of diverse environments. To quantify the differences between environments, we construct a neural network for each new data set and measure the degree of correlation between it and the other existing networks. For the data set described in Table 1, for example, we trained five neural networks without hidden units, one for each image set. The network weights define a single hyperplane discriminant boundary between targets and clutter. We parameterize the hyperplanes in terms of their direction cosines and define the correlation between these hyperplanes as a function of the dotproduct of the direction-cosine vectors and their normal distances from the origin.' When the correlation between a given hyperplane and the others is small, we add the new network to the second stage. If the correlation is large, then we do not add a network unless the orientations of the surfaces are reversed (that is, the placement of targets and clutter is inverted).
4 Conclusions We presented a two-stage, modular neural network for classifying objects in different environments. The first-stage neural network incorporates global environmental parameters and switches among different neural network classifiers at the second stage that are trained for specific environments. Inclusion of global environmental features provides for more robust classification. Moreover, the modular nature of our approach yields shorter training times than more monolithic network approaches. Acknowledgments We thank Bill Boyd and Jerry Herstein for providing the infrared and TV data. We also thank Virginia Fuller for creating the graphics. 'More generally, we consider the correlations between the directions of maximum separability of the target points and the clutter points.
Robust Classifiers
479
References DARPA. 1988. Neural Network Study. AFCEA International Press, Fairfax, VA. Hampshire, J. B., and Waibel, A. 1990. Connectionist architectures for multispeaker phoneme recognition. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 203-210. Morgan Kaufmann, San Mateo, CA. Roth, M. W. 1990. Survey of neural network technology for automatic target recognition. lEEE Transact. Neurar Networks 1,28-43. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, Vol. 1, pp. 318-362. MIT Press, Cambridge, MA. Waibel, A. 1989. Modular construction of time-delay neural networks for speech recognition. Neural Comp. 1,3946.
Received 8 June 90; accepted 17 August 90.
Communicated by Richard Lippmann
Use of an Artificial Neural Network for Data Analysis in Clinical Decision-Making: The Diagnosis of Acute Coronary Occlusion William G. B a t Department of Medicine, University of California, San Diego Medical Center, Son Diego, C A 92103 USA A nonlinear artificial neural network trained by backpropagation was applied to the diagnosis of acute myocardial infarction (coronary occlusion) in patients presenting to the emergency department with acute anterior chest pain. Three-hundred and fifty-six patients were retrospectively studied, of which 236 did not have acute myocardial infarction and 120 did have infarction. The network was trained on a randomly chosen set of half of the patients who had not sustained acute myocardial infarction and half of the patients who had sustained infarction. It was then tested on a set consisting of the remaining patients to which it had not been exposed. The network correctly identified 92% of the patients with acute myocardial infarction and 96% of the patients without infarction. When all patients with the electrocardiographic evidence of infarction were removed from the cohort, the network correctly identified 80% of the patients with infarction. This is substantially better than the performance reported for either physicians or any other analytical approach. 1 Introduction
Decision-making under uncertainty is often fraught with great difficulty when the data on which the decision is based are imprecise and poorly linked to predicted outcome (Holloway 1979). Clinical diagnosis is an example of such a setting (Moskowitz et al. 1988)because multiple, often unrelated, disease states can present with similar or identical historical, symptomalogic, and clinical data. In addition, singular disease states do not always present with the same historical, symptomalogic, and clinical data. As a result, physician accuracy in diagnosing many of these diseases is often disappointing. A number of approaches have been developed to analyze data collected during patient evaluation to improve on diagnostic accuracy, but none of these approaches has been able to improve significantly on the performance of well-trained physicians (Reggia and Tuhrim 1985; Szolovits et al. 1988). The question still remains as to Neural Computation 2, 480489 (1990) @ 1990 Massachusetts Institute of Technology
Neural Networks in the Diagnosis of Coronary Occlusion
481
whether there is any means by which the data available in the clinical setting can be analyzed to yield information that can be utilized to improve diagnostic accuracy. Acute myocardial infarction is an example of a disease process that has been difficult to diagnose accurately. A considerable number of methodologies have been developed in attempts to improve on the diagnostic accuracy of physicians in identifying the presence of acute myocardial infarction (Pozen et al. 1977, 1980, 1984; Goldman et al. 1982, 1988a; Patrick et al. 1976, 1977; Lee et al. 1985, 1987a,b; Tierney et al. 1985). Stepwise discriminate analysis (Pozen et al. 1977), logistic regression (Pozen et al. 1980), recursive partition analysis (Goldman et al.’1982), and pattern recognition (Patrick et al. 1976, 1977) have been utilized. The best of these approaches has performed with the same detection rate (sensitivity) (88%)and slightly better false alarm rate (1.0-specificity) (26% vs. 29%) than physicians (Goldman et al. 1988a). The following reports on the use of artificial neural network techniques (Widrow and Hoff 1960; Rumelhart et al. 1986; McClelland and Rumelhart 1988; Weigend et al. 1990; Mulsant and Servan-Schreiber 1988; Hudson et al. 1988; Smith et al. 1988; De Roach 1989; Saito and Nakano 1988; Marconi et al. 1989) to determine if the data collected during the routine evaluation of patients for acute myocardial infarction contain previously inapparent information that can be used to improve on the diagnostic accuracy of predicting the presence of acute myocardial infarction. 2 Methods
The nonlinear artificial neural network was a multilayer perceptron trained with backpropagation by use of the McClelland and Rumelhart simulator (McClelland and Rumelhart 1988). Figure 1 depicts the topology of the network utilized. The network was trained by dividing the available data into a training set and a test set. Training took place by choosing input patterns from the training set and allowing activation to flow from the inputs through the hidden units to the output unit. The value of the output unit activation was then compared to the documented diagnosis for each pattern. The difference (error) between the actual activation of the output unit and the correct value was then utilized by the backpropagation algorithm (Rumelhart et al. 1986; McClelland and Rumelhart 1988) to modify all weights of the network so that future outputs approximate the correct diagnosis . Because most patients presenting to the emergency department with anterior chest pain are not suffering from acute myocardial infarction (Goldman et al. 1988a1, a subset of patients with a much greater probability of having sustained infarction were chosen for this study. To this end, only patients admitted to the coronary care unit were studied. In this
William G. Baxt
482 ~
INPUT UNITS 0 0 0 0 0 0 0 0 0 0 0 0.0.0
HIDDEN UNITS
HIDDEN UNITS
OUTPUT UNIT
Figure 1: 20 x 10 x 10 x 1 nonlinear artificial neural network. The network has 20 input units, two layers of 10 hidden units each, and one output unit. Only the connections from one input and one hidden unit from each layer are shown. The network simulator program was run on a 80386 microcomputer with an 80387 math coprocessor running at 20 mHz. Epsilon was set a t 0.05. Alpha was set at 0.9. Initial weights were random. Training times ranged between 8 and 48 hr. way, the network was presented with the potentially most challengng pattern sets to differentiate. A retrospective chart review was performed on 356 patients who were admitted through the emergency department to the coronary care unit to rule out the presence of infarction. Forty-one variables reported to be predictive of the presence of acute myocardial infarction (Pozen et al. 1977, 1980, 1984; Goldman et al. 1982, 1988a; Patrick et al. 1976, 1977; Lee et al. 1985, 1987a,b,; Tierney et al. 1985) (depicted in Table 1) were collected on all patients from the emergency department record. The manner in which the presence or absence of infarction was determined was also documented from the inpatient record. The presence of infarction was confirmed as reported elsewhere (Goldman et al. 1988a). The input patterns were generated by a specially written program that coded most of the clinical input variables in a binary manner such that 1 equalled the presence of a finding and 0 the absence of a finding. Patient age, blood pressure, pulse, and pain intensity were coded as analog values between 0.0 and 1.0. The target value for the output was coded as 0 for the subsequently confirmed absence of acute myocardial infarction and 1 for the confirmed presence of infarction.
Neural Networks in the Diagnosis of Coronary Occlusion
483
History
Past History
Examination
Electrocardiogram findines
Age'
Past AMI*
Systolic BP
2 mm ST elevation'
Sex'
Angina'
Diastolic BP
1 mm ST elevation*
Location of pain*
Congestive heart failure
Pulse
ST depression*
Jugular venous distension*
T wave inversion*
Diabetes' Hypertension*
Rales' Third heart sound
Pain pleuritic
Family history AM1
Similar to past AM1
High cholesterol
Intensity of pain
Peaked T wave
Duration of pain Radiation of pain
Premature ventricular contractions
Fourth heart sound Heart block Edema Response to pressure
Coffee
Intraventricular conduction defect
Cigarettes Response to nitroglycerin*
Significant ischemic change'
Nausea and vomiting* Diaphoresis' Syncope* Shortness of breath' Palpitations'
Table 1: Input Variables. Variables marked with "*" utilized in final pattern sets.
To find a predictive set of input variables, different input pattern formats utilizing different numbers and combinations of the input variables were tested on networks that had as little as 5 to as many as 41 input units. To find a more optimal network architecture, different numbers of hidden units arranged in different numbers of layers were tested. A network with 20 inputs and 2 layers of 10 hidden units each, as depicted in Figure 1, utilizing the 20 clinical input variables noted in Table 1, was chosen on the basis of this analysis. Learning was followed by totaling the sum square (TSS) error over the pattern set (Rumelhart et al. 1986). Input patterns were presented to the network and learning epochs run
William G. Baxt
484
until the TSS ceased decreasing. The final weights derived from a training session were then saved for use in testing. Testing of a network was accomplished by using the weights derived in the training set and presenting the network with patterns to which it had not been exposed. Performance was scored as correct if the activation of the output unit was equal to or greater than 0.8 when the target was 1 or when the activation of the output unit was equal to or less than 0.2 when the target was 0. The output unit activation in this study was always between 0 and 0.2 and 0.8 and 1.0. Detection rate (sensitivity) was defined as the number of patients in a test population correctly diagnosed as having a disease divided by the total number in the test set with the disease. False alarm rate (specificity) was defined as the number of patients in a test population correctly diagnosed as not having a disease divided by the total number in the test set without the disease. 3 Results
The network was trained utilizing a randomly chosen subset of patterns derived from the initial group of 356 patients. Half of the patients who had not sustained acute myocardial infarctions and half of the patients who had sustained infarctions were selected. The subset consisted of 118 patients who were diagnosed as not having sustained an infarction and 60 patients who were diagnosed as having sustained an infarction. The final TSS reached was 0.044 error per pattern. The network was then tested on the remaining 178 patients (118 noninfarction, 60 infarction) to which it had not been exposed. The network correctly diagnosed 55 of the 60 patients with infarction and 113 of the 118 patients without infarction. This process was repeated utilizing the second pattern set for network training and the first set for testing. The initial TSS achieved during training was 0.02 error per pattern. The network correctly diagnosed 56 of the 60 patients with acute myocardial infarction and 113 of the 118 patients without infarction on the test set. The summed results of the two test sets are depicted in Table 2. The network performed with a detection rate of 92% and a false alarm rate of 96%. Noninfarction Infarction Correct Incorrect
226 -10
111 -9
Table 2: All Patients. Detection rate (sensitivity),92%; false alarm rate (1.0specificity), 4%.
Neural Networks in the Diagnosis of Coronary Occlusion
Noninfarction Correct Incorrect
47 -4
485
Infarction 44 -7
Table 3: Infarction Patients without ST Elevation. Detection rate (sensitivity), 86%; false alarm rate (1.0-specificity), 8%.
A significant number of patients who present to the emergency department who have sustained acute myocardial infarction have clear-cut electrocardiographic evidence of infarction. The real diagnostic challenge arises in those patients who have sustained infarction, but do not have clear-cut evidence of infarction on their initial electrocardiogram. When such patients were omitted from one study that attempted to improve on physician diagnostic performance, detection fell significantly (Goldman et al. 1988b). To determine if this approach could effectively identify new information under the most challenging circumstances, the network was further trained and tested on those patients without clear-cut electrocardiographic evidence of acute myocardial infarction. Fifty-two percent of the patients with a documented infarction had acute ST segment elevation on their initial electrocardiogram and none of the patients without infarction had this finding. To study the effect of eliminating such patients, the network was trained on a pattern set derived from half of the patients who sustained infarctions who did not have ST elevation on their initial electrocardiogram along with an equal number of randomly selected patients who had not sustained infarctions. The network was then tested on the second half of the patients who had sustained infarctions who did not have ST elevation on their initial electrocardiogram along with a randomly chosen equal number of patients from the group that had not sustained infarctions. As above, the process was then reversed utilizing the second pattern set for training and the first set for testing. The results are summarized in Table 3. The network performed with a detection rate of 86% and a false alarm rate of 92%, indicating that network performance was not dependent on the presence of ST elevation on the initial electrocardiogram. Eighty-three percent of the patients with a documented acute myocardial infarction had either acute ST elevation or new ischemic change on their initial electrocardiogram and none of the patients without infarction had this finding. To further study the effect of clear-cut electrocardiographic markers, a set of patients whose initial electrocardiogram showed neither ST elevation nor new ischemic change were identified.
486
William G. Baxt
There were 20 such patients. These patients were combined with 20 randomly chosen patients who had not sustained infarction. Because of the small sample size, a leave-one-out strategy was used to test network performance. Input patterns for training were derived from 19 of the 20 patients who had sustained infarction who had neither ST elevation nor significant ischemic change on their initial electrocardiogram along with 20 randomly chosen patients who had not sustained infarction. Twenty such sets of training data were constructed by removing a different infarction patient in each set. The network was trained on each of these pattern sets and tested on the one infarction patient that had been removed. The network correctly identified 16 of the 20 patients with infarction (detection rate 80%), further indicating that network performance was not predominantly dependent on electrocardiographic markers of infarction. 4 Discussion
These data reveal that the artificial neural network had a detection rate of 92% and a false alarm rate of 4%, whereas the best previously reported performance had a detection rate of 88% and a false alarm rate of 26%. Although these results are encouraging, future studies will need to address some of the questions that were not fully answered in this study. The proof that the nonlinear artificial neural network identified and utilized new information rests on the improvement of diagnostic accuracy derived from comparisons to studies reported in the literature. The results reported here are, thus, compared to studies based on a different set of data. Valid comparisons between methodologies must use the same data sets. In addition, the physician performance on the data set studied herein was not determined and may have been better than that described in the literature. Absolute conclusions about comparative performance wilI need to be derived from the prospective study of this question. Further, the studies to which these data were compared evaluated all patients presenting to the emergency department with nontraumatic chest pain. This study analyzed data collected only from patients who were admitted to the coronary care unit. The consequence of this will require study. The good performance afforded by the network deserves comment. Previously utilized statistical strategies have been based on one of three approaches: (1) tree structure rule-based interrelationships, (2) linear pattern matching, or (3) statistical probability calculations (Szolovits et al. 1988). All of these methods are heavily dependent on the consistency of input data for proper performance. One of the striking aspects about the presentation of most disease states is the lack of consistency in their presentation. This emanates from both vague and imprecise clinical histories as well as marked variations in the symptom clusters
Neural Networks in the Diagnosis of Coronary Occlusion
487
and clinical findings with which identical disease processes can present. Decision modalities that are highly dependent on consistency of input to arrive at correct diagnostic closure will perform poorly in this setting. All of the other approaches to clinical decision-making alluded to above are based on a highly structured set of rules or statistical probability prediction that are dependent on the accuracy of input data. One possible reason for the good performance of the artificial neural network is that the nonlinear statistical analysis of the data performed tolerates a considerable amount of imprecise and incomplete input data. These networks appear to be able to cope with the subtle Variations in the way disease processes present without making categorical decisions solely driven by these variations. The networks appear to be able to discover implicit higher order conditional dependencies in patterns that are not apparent on face value and to utilize these dependencies to derive generalized rules that are resistant to most minor input perturbations. Specifically, the network can shift from one set of input variables to another and still make accurate prediction based on the actual data at hand. It is this resistance to the perversion of accurate generalizations that enables the network to function more accurately in the clinical environment. The network used input data that are routinely available to and utilized by physicians screening patients for the presence of acute myocardial infarction. The network simply discovered relationships in these data that are evidently not immediately apparent to physicians and was able to use these to come to a more accurate diagnostic closure. Because these relationships can be made explicit by studying the network weighting of input data, physicians could potentially utilize this information to make a more accurate diagnosis. The actual possibility of this will depend on the complexity of the interrelationships defined by the network. It has been demonstrated that when networks with more than one hidden layer are required to achieve optimal training, the solutions are often distributed over multiple units and are difficult to identify (Weigend et al. 1990). However, if these relationships can be elucidated, the use of the network may be unnecessary. These observations must be validated by extending this study to a larger number of patients, followed by the prospective testing of the relationships identified by the network. If these results hold up to such scrutiny, the improvement in predictive accuracy could have a substantial impact on the reduction of health care costs. Furthermore, these techniques may be able to be extended to other clinical settings.
Acknowledgments
I thank Dr. David Zipser for his help with this study and Kathleen James for her help in the preparation of this manuscript.
488
William G. Baxt
References De Roach, J. N. 1989. Neural networks - An artificial intelligence approach to the analysis of clinical data. Austral. Phys. Engineer. Sci. Med. 12, 100-106. Goldman, L., Weinberg, M., Weisberg, M., Olshen, R., Cook, E. F., Sargent, R. K., Lamas, G. A., Dennis, C., Wilson, C., Deckelbaum, L., Fineberg, H., Stiratelli, R., and the Medical House Staffs at Yale-New Haven Hospital and Brigham and Women’s Hospital. 1982. A computer-derived protocol to aid in the diagnosis of emergency room patients with acute chest pain. N. Engl. J . Med. 307, 588-596. Goldman, L., Cook, E. F., Brand, D. A., Lee, T. H., Rouan, G. W., Weisberg, M. C., Acampora, D., Stasiulewicz, C., Walshon, J., Terranova, G., Gottlieb, L., Kobernick, M., Goldstein-Wayne, B., Copen, D., Daley, K., Brandt, A. A., Jones, D., Mellors, J., and Jakubowski, R. 1988a. A computer protocol to predict myocardial infarction in emergency department patients with chest pain. N. Engl. J. Med. 318, 797-803. Goldman, L., Cook, E. F., Brand, D. A., Lee, T. H., and Rouan, G. W. 1988b. Letter to the editor. N. Engl. J. Med. 319, 792. Holloway, C. A. 1979. Behavioral assumptions and limitations of decision analysis. In Decision Making Under Uncertainty: Models and Choices, C. A. Holloway, ed., pp. 436-455. Prentice-Hall, Englewood Cliffs, NJ. Hudson, D. L., Cohen, M. E., Anderson, M. F. 1988. Determination of testing efficacy in carcinoma of the lung using a neural network model. Symp. Comput. Applic. Med. Care Proc. 12,251-255. Lee, T.H., Cook, E. F., Weisberg, M., Sargent, R. K., Wilson, C., and Goldman, L. 1985. Acute chest pain in the emergency ward: Identification and examination of low-risk patients. Arch. Intern. Med. 145, 65-69. Lee, T. H., Rouan, G. W., Weisberg, M. C., Brand, D. A., Cook, F., Acampora, D., Goldman, L., and the Chest Pain Study Group; Boston, MA; New Haven, Danbury, and Milford, CT; and Cincinnati, OH. 1987a. Sensitivity of routine clinical criteria for diagnosing myocardial infarction within 24 hours of hospitalization. Ann. Intern. Med. 106, 181-186. Lee, T. H., Rouan, G. W., Weisberg, M. C., Brand, D. A., Acampora, D., Stasiulewicz, C., Walshon, J., Terranova, G., Gottlieb, L., Goldstein-Wayne, B., Copen, D., Daley, K., Brandt, A. A., Mellors J., Jakubowski, R., Cook, E. F., and Goldman, L. 198%. Clinical characteristics and natural history of patients with acute myocardial infarction sent home from the emergency room. Am. J. Cardiol. 60, 219-224. Marconi, L., Scalia, F., Ridella, S., Arrigo, P., Mansi, C., and Mela, G. S. 1989. An application of back propagation to medical diagnosis. Symp. Comput. Applic. Med. Care Proc., in press. McClelland, J. L., and Rumelhart, D. E., eds. 1988. Training hidden units. In Explorations in Parallel Distributed Processing, pp. 121-160. M U Press, Cambridge, MA. Moskowitz, A. J., Kuipers, B. J., and Kassirer, J. I? 1988. Dealing with uncertainty, risks, and trade-offs in clinical decisions. A cognitive science approach. Ann. Intern. Med. 108, 435449.
Neural Networks in the Diagnosis of Coronary Occlusion
489
Mulsant, G. H., and Servan-Schreiber, E. 1988. A connectionist approach to the diagnosis of dementia. Symp. Comput. Applic. Med. Care Proc. 12, 245-250. Patrick, E. A., Margolin, G., Sanghvi, V., and Uthurusamy, R. 1976. Pattern recognition applied to early diagnosis of heart attacks. In Proceedings of the IEEE 1976 Systems, Man, and Cybernetics Conference, Washington, D.C., November 1-3, pp. 403406. Patrick, E. A., Margolin, G., Sanghvi, V., and Uthurusamy, R. 1977. Pattern recognition applied to early diagnosis of heart attacks. In Proceedings of the 1977 International Medical lnformation Processing Conference (MEDINFO), Toronto, August 9-12, pp. 203-207. Pozen, M. W., Stechmiller, J. K., and Voigt, G. C. 1977. Prognostic efficacy of early clinical categorization of myocardial infarction patients. Circulntion 56, 816-819. Pozen, M. W., DAgostino, R. B., Mitchell, J. B., Rosenfeld, D. M., Guglielmino, J. M., Schwartz, M. L., Teebagy, N., Valentine, J. M., and Hood, W. B. 1980. The usefulness of a predictive instrument to reduce inappropriate admissions to the coronary care unit. Ann. Intern. Med. 92, 238-242. Pozen, M. W., DAgostino, R. B., Selker, H. P., Sytkowski, P. A,, Hood, W. B., Jr. 1984. A predictive instrument to improve coronary-care-unit admission practices in acute ischemic heart disease: A prospective multicenter clinical trial. N. Engl. I. Med. 310, 1273-1278. Reggia, J. A., and Tuhrim, S., eds. 1985. Computer Assisted Medical Decision Making. Computers in Medicine Series, Vol. 2. Springer-Verlag, New York. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. E. Rumelhart and J. L. McClelland, eds., pp. 318-364. MIT Press, Cambridge, MA. Saito, K., and Nakano, R. 1988. Medical diagnostic expert system based on PDP model. In Proceedings of the International Joint Conference on Neural Networks, I, 255-262. Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., and Johannes, R. S. 1988. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Symp. Compuf. Applic. Med. Care Proc. 12, 261-265. Szolovits, P.,Patil, R. S., and Schwartz, W. B. 1988. Artificial intelligence in medical diagnosis. Ann. Intern. Med. 108, 80-87. Tierney, M. W., Roth, B. I., Psaty, B., McHenry, R., Fitzgerald, J., Stump, D. L., Anderson, F. K., Ryder, K. W., McDonald, C. J., and Smith, D. M. 1985. Predictors of myocardial infarction in emergency room patients. Crit. Care Med. 13, 526-531. Weigend, A. S., Huberman, B. A,, and Rumelhart, D. E. 1990. Predicting the future: A connectionist approach. PDP Research Group Technical Report. Int. J. Neural Syst., submitted. Widrow, G., and Hoff, M. E. 1960. Adaptive Switching Circuits Institute of Radio Engineering Western Electronic Show and Convention. Convention Record, Part 4, pp. 96104. Received 29 August 90; accepted 26 September 90.
Communicated by Fernando Pineda
An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories Ronald J. Williams Jing Peng College of Computer Science, Northeastern University, Boston, M A 02225 USA A novel variant of the familiar backpropagation-through-time approach to training recurrent networks is described. This algorithm is intended to be used on arbitrary recurrent networks that run continually without ever being reset to an initial state, and it is specifically designed for computationally efficient computer implementation. This algorithm can be viewed as a cross between epochwise backpropagation through time, which is not appropriate for continually running networks, and the widely used on-line gradient approximation technique of truncated back-
propagation through time. 1 Introduction
Artificial neural networks having feedback connections can implement a wide variety of dynamic systems. The problem of training such a network is the problem of finding a particular dynamic system from among a parameterized family of such systems that best fits the desired specification. This paper proposes a specific learning algorithm for temporal supervised learning tasks, in which the specification of desired behavior is in the form of specific examples of input and desired output trajectories. One example of such a task is sequence classification, where the input is the sequence to be classified and the desired output is the correct classification, which is to be produced at the end of the sequence. Another example is sequence production, in which the input is a constant pattern and the corresponding desired output is a time varylng sequence. More generally, both the input and desired output may be time varying. A number of recurrent network learning algorithms have been investigated, many based on computation of the gradient of an error measure in weight space. Some of these use methods for computing this gradient exactly and some involve approximations designed to simplify the computation. Two specific approaches that use exact gradient computation methods are backpropagation through time (BPTT), to be described in somewhat more detail below, and real-time recurrent learning (RTRL). Both of these methods have been independently rediscovered by a number of investigators in the neural network field, and their origins can, in Neural Computation 2, 490-501 (1990) @ 1990 Massachusetts Institute of Technology
Algorithm for Training Recurrent Networks
491
fact, be traced to a much earlier literature on optimal control of nonlinear dynamic systems. One reference for BPTT is Rumelhart et al. (1986), and a reference for RTRL is Williams and Zipser (1989a). A related algorithm is the recurrent backpropagation algorithm, developed independently by Almeida (1987) and by Pineda (1987), which can be viewed essentially as a computationally attractive special case of BPTT appropriate for situations when both the actual and desired trajectories consist of settling to a constant state. An extensive discussion of a number of gradient-based learning algorithms, including BPTT, RTRL, recurrent backpropagation, and other related algorithms can be found in Williams and Zipser (1990). Among the algorithms described in detail there is the particular one to be highlighted here. The algorithm to be described here has five important properties. First, it is an on-line algorithm, designed to be used to train a network while it runs; no "manual" state resets or segmentation of the training stream into epochs is required. Second, it is a general-purpose algorithm, intended to be used with networks having arbitrary recurrent connectivity; no special architectures are assumed. Third, it is designed to train networks to perform arbitrary time-varying behaviors; it is not restricted to settling or periodic behaviors. Fourth, it is designed for time-efficient implementation in the sense that it minimizes the amount of computation required per time tick as the network runs. Finally, it has been experimentally verified to work at least as well as other comparable algorithms in solving a number of recurrent network learning problems; that is, it finds such solutions at least as often as these other algorithms and it generally requires no more time steps of network operation to do so. It is important to point out, however, that one property not claimed for this algorithm is biological plausibility. The real aim of the work reported here is to provide to those wishing to experiment with adaptive recurrent networks an algorithm combining some of the most attractive features of algorithms currently in use. In particular, this new algorithm is designed to enjoy the computational efficiency of BPTT while retaining the on-line character of RTRL. As will be seen below, it is not a radically new algorithm at all but is really just a more efficient variant of a method already being used successfully in some recurrent network research circles.' 2 Formal Assumptions and Definitions
2.1 Network Architecture and Dynamics. For concreteness, we assume a network of semilinear units; it is straightforward to derive corresponding algorithms for a wide variety of alternative unit transfer 'In fact, since submitting this for publication, we have discovered that others have independently made use of the particular modification we describe here as well, although we know of no previously published accounts calling attention to this improved algorithm. Among those who have used it in some of their work are Mike Jordan (personal communication, 1990) and Mike Mozer (personal communication, 1990).
492
R.J. Williams and J. Peng
functions. Also, we restrict attention here to a discrete-time formulation of the network dynamics. Let the network have 71 units, with m external input lines. Let y ( t ) denote the n-tuple of outputs of the units in the network at time t, and let xnet(t)denote the m-tuple of external input signals to the network at time t. We also define x(t) to be the (m + n)-tuple obtained by concatenating xnet(t)and y ( t ) in some convenient fashion. To distinguish the components of x representing unit outputs from those representing external input values where necessary, let U denote the set of indices k such that q., the kth component of x, is the output of a unit in the network, and let I denote the set of indices k for which XI,is an external input. Furthermore, we assume that the indices on y and xnetare chosen to correspond to those of x, so that
For example, in a computer implementation using zero-based array indexing, it is convenient to index units and input lines by integers in the range [0,m + n), with indices in [0,rn) corresponding to input lines and indices in [m,m + n ) corresponding to units in the network. Note that one consequence of this notational convention is that x~( t )and yk(t) are two different names for the same quantity when k E CJ. The general philosophy behind this use of notation is that variables symbolized by z represent input and variables symbolized by y represent output. Since the output of a unit may also serve as input to itself and other units, we will consistently use X L when its role as input is being emphasized and yL when its role as output is being emphasized. Furthermore, this naming convention is intended to apply both at the level of individual units and at the level of the entire network. Thus, from the point of view of the network, its input is denoted xnetand, had it been necessary for this exposition, we would have denoted its output by ynetand chosen its indexing to be consistent with that of y and x. Let W denote the weight matrix for the network, with a unique weight between every pair of units and also from each input line to each unit. By adopting the indexing convention just described, we can incorporate all the weights into this single n x (m + n ) matrix. The element w , ~ represents the weight on the connection to the zth unit from either the jth unit, if J f U , or the jth input line, if J E 1. Furthermore, note that to accommodate a bias for each unit we simply include among the m input lines one input whose value is always 1; the corresponding column of the weight matrix contains as its zth element the bias for unit 2. In general, our naming convention dictates that we regard the weight w,]as having z3as its "presynaptic" signal and yz as its "postsynaptic" signal. For the semilinear units used here it is convenient to also introduce for each k the intermediate variable s k ( t ) ,which represents the net input
Algorithm for Training Recurrent Networks
493
to the kth unit at time t. Its value at time t + 1 is computed in terms of both the state of and input to the network at time t by (2.2)
The output of such a unit at time t net input by !h(t
+ 1 is then expressed in terms of the
+ 1) = f A [ s r ( t + 111
(2.3)
where f A is the unit’s squashing function. It is convenient in the algorithm descriptions given below to allow complete generality in the choice of squashing functions used (except for requiring them to be differentiable), but it is typical to let them all be the logistic function
In this case, the algorithms to be discussed below will make use of the fact that
f; 1% (f)l
= UI.( f ) [ 1- YI.
(01
The system of equations (2.2) and (2.3), where k ranges over U , constitute the entire discrete-time dynamics of the network, where the X L values are defined by equation 2.1. Note that the external input at time t does not influence the output of any unit until time t + 1. We are thus treating every connection as having a one-time-step delay. 2.2 Network Performance Measure. Assume that the task to be performed by the network is a sequential supervised learning task, meaning that certain of the units’ output values are to match specified target values at specified times. Let T(t) denote the set of indices k E U for which there exists a specified target value & ( t ) that the output of the kth unit should match at time t. Then define a time-varying n-tuple e by
& ( t )- y l ( f ) if k
E T(t) otherwise
Note that this formulation allows for the possibility that target values are specified for different units at different times. Now let
E ( t ) = 1/2
c
[PA(t)I2
ItU
denote the overall network error at time t . A natural objective of learning is to minimize the total error t
R. J. Williams and J. Peng
494
over some appropriate time period (t’,t]. The gradient of this quantity in weight space is, of course, t
VWEtotal(t’,t ) 1
VwE(T)
r=t‘+l
The point of computing this error gradient is to use it to adjust the weights. One natural way to make these weight changes is along a constant negative multiple of this gradient, so that
where Q is a positive learning rate parameter. We limit our discussion here to algorithms having this particular form, although other ways of incorporating this error gradient information are also compatible with the computational strategies we describe. 3 Some Related Approaches
The algorithm to be described can be viewed as a cross between two familiar algorithms based on the backpropagation-through-time approach to computing the error gradient in weight space. In fact, this algorithm reduces to these two algorithms in two extreme cases. Before presenting the new algorithm, we first review these more familiar algorithms. 3.1 Epochwise Backpropagation Through Time. The backpropagation-through-time approach can be derived by unfolding the temporal operation of a network into a multilayer feedforward network that grows by one layer on each time step. If the training stream is segmented into epochs, then one can derive the specific version that we will call epochwise backpropagation through time. This algorithm is organized as follows. With t o denoting the start time of the epoch and t , denoting its end time, the objective is compute the gradient of Etotal(to, t l ) . This is done by first letting the network run through the interval [to,t,] and saving the entire history of inputs to the network, network state, and target vectors over this interval. Then a single backward pass over this history buffer is performed to compute the set of values S k ( r ) = -dEtotal(to,tl)/ask(r),for all k E U and 7 E ( t o , tl],by means of the equations
This can be viewed as representing the familiar backpropagation computation applied to a feedforward network in which target values are specified for units in many layers, not just the last. The process begins at
Algorithm for Training Recurrent Networks
495
the last time step and proceeds to earlier time steps through the repeated use of these equations. When describing this algorithm it is helpful to speak of injecting error at time T to mean the computational step of adding e k ( T ) for each k to the appropriate sum when computing the bracketed expression in the equation for & ( T ) . We also consider the computation of 6~( t , ) to involve a corresponding injection of error; in this case, it can be viewed as adding el ( t l ) to 0. Once the backpropagation computation has been performed back to time to + 1, the weight changes may be made along the negative gradient of overall error by means of the equations
When a network having 71 units and O ( n 2 )weights is run over a single epoch of h time steps, it is easy to see that this algorithm requires the storage of O(nh)real numbers and performs O(n2h)arithmetic operations during the backward pass, together with another O(n2h)operations to compute the weight updates. 3.2 Truncated Backpropagation Through Time. If the training stream is not segmented into independent epochs, one can still consider using BPTT. The idea is to compute the negative gradient of E(1) and make the appropriate weight changes at each time t while the network continues to run. To do this requires saving the entire history of network input and network state since the starting time to. Then, for each fixed 1, a set of values d k ( 7 ) = -i)E(t~,t~)/as~(~) for all k E U and T E ( t , , f ] are computed by means of the equations
The process begins at the most recent time step by injecting error there, but, unlike epochwise BPTT, error is not injected for any earlier time steps. This is why earlier target values need not be saved for this algorithm. Once the backpropagation computation has been performed back to time t o+I, the weight changes may be made along the negative gradient of E ( t ) by means of the equations (3.2) This algorithm, which can be called real-time backpropagation through time is just a particular way of organizing the computation of the gradient of E ( t ) in weight space. Since this algorithm involves computation time and storage that grow linearly with time as the network runs, no
R. J. Williams and J. Peng
496
one would seriously consider using this particular algorithm for on-line learning. An alternative strategy for computing this same quantity that requires strictly bounded memory and computation time at every time step is given by RTRL (Williams and Zipser 1989a). In a network having n units and O(n2)weights, RTRL requires the storage of O(n3)real numbers and the performance of O(n4)arithmetic operations per time step. Although RTRL is clearly more attractive than real-time BPTT for on-line training of recurrent networks, its computational complexity limits the size of networks to which it can be comfortably applied unless some form of heuristic simplification, such as that proposed by Zipser (1989), is also employed. A different strategy for keeping the computational requirements bounded on every time step is to use a bounded-history approximation to real-time BPTT in which relevant information is saved for a fixed number h of time steps and any information older than that is forgotten. In general, this should be regarded as a heuristic technique for simplifying the computation, although, as discussed below, it can sometimes serve as an adequate approximation to the true gradient and may also be more appropriate in those situations where weights are adjusted as the network runs. Let us call this algorithm truncated backpropagafion through time. With h representing the number of prior time steps saved, this algorithm will be denoted BPTT(h). For BPTT(h), one computes, at each time t, the values & ( T ) for all k E U , but only for T E ( t - h, t ] ,using the same equations 3.1 as before. After these have been computed, weight changes are made using AwIj= N
S , ( T ) X ~ ( T- 1) r=t-hi1
which is, of course, just equation 3.2 with all terms for which r 5 t - h taken to be 0. In a network of n units and O ( n 2 )weights, this algorithm clearly requires O(nh)storage and, at each time step, requires O(n2h)arithmetic operations for the backward pass through the history buffer and another O(n2h)operations to compute the weight updates.2 When compared with RTRL, BPTT(h) with any reasonably small h is clearly a much more efficient on-line algorithm in terms of the computational effort required per time step. An extreme example of this truncation strategy is found in the algorithm explored by Cleeremans et al. (1989), which combines the use of 2When the weights are adjusted on every time step, it may be appropriate to store all the past weights as well, using these on the backward pass rather than the current weights. This variant obviously requires O(n2h)storage. In practice, we have not found much difference in performance between this version and the simpler one described here.
Algorithm for Training Recurrent Networks
497
BPTT(1) in the recurrent portion of a particular network architecture with feedforward backpropagation in the remainder of the network. 4 The Improved Algorithm
Now consider a problem in which either BPTT(h) or epochwise BPTT might be used, say a problem involving training over a single epoch which is much longer than h. Epochwise BPTT will, of course, require more storage, but it is clear that it will still only require an average of O(n2)arithmetic operations per time step, compared to O(n2h) for BPTT(h). Thus, in terms of execution time, BPTT(h) is slower than epochwise BPTT by esentially a factor of h. Since h must generally be chosen sufficiently large to permit a particular task to be learned, it is not an option to reduce h to speed up the execution time.3 One may thus ask whether there are any effective general-purpose on-line algorithms having the same execution time as epochwise BPTT. As it turns out, there is a class of on-line algorithms whose average computational effort. expended per time step asymptotically approaches that of epochwise BPTT. This class of algorithms is obtained by combining aspects of epochwise BPTT with the truncated BPTT approach. Note that in BPTT(h) a backward pass through the most recent h time steps is performed anew each time the network is run through an additional time step. To generalize this, one may consider letting the network run through h’ additional time steps before performing the next BPTT computation, where h’ 5 h . In this case, if t represents a time at which BPTT is to be performed, the algorithm computes an approximation to VwEtota’(t - h ’ , f ) by taking into account only that part of the history over the interval [t - h , t ] . Specifically, this involves performing a backward pass at time t to compute the set of values 6 k ( ~ )for all k E U and T E ( t - h, t ] ,just as with BPTT(h), but this time using the equations
After this backward pass has been performed, weight updates can be computed using
just as with BPTT(h) 31n particular, h must generally be chosen to be roughly as large as typically encountered delays between input and corresponding output in the training stream.
R. J. Williams and J. Peng
498
The key feature of this algorithm is that the next backward pass is not performed until time step t h’; in the intervening time the history of network input, network state, and target values is saved in the history buffer, but no processing is performed on these data. Let us denote this algorithm BPTT(h;h’). Clearly BPTT(h) is the same as BPTT(h;11, and BPTT(h; h ) is the epochwise BPTT algorithm. The storage requirements of this algorithm are essentially the same as those of BPTT(h), except that it requires the additional storage of h’ - 1 sets of prior target values. However, because it computes the cumulative error gradient by means of BPTT only once every h’ time steps, its average time complexity per time step is reduced by a factor of h’. It thus requires O(71.h)space and requires performing an average of O(n2h/h’) operations per time step in a network having n units and O ( d ) weights. Furthermore, it is clear that making hf h’ small makes the algorithm more efficient. At the same time, to obtain a reasonably close approximation to the true gradient (or at least to what would be obtained through the use of truncated BPTT), it is only necessary to make the difference h - h’ sufficiently large. This is because no error is injected for the earliest h - h’ time steps in the buffer. Thus a practical and highly efficient on-line algorithm for recurrent networks is obtained by choosing h and h‘ so that h-h‘ is large enough that a reasonable approximation to the true gradient is obtained and so that hf h’ is reasonably close to 1.
+
5 Experimental Performance An important question to be addressed in studies of recurrent network learning algorithms, whatever other constraints to which they must conform, is how much total computational effort must be expended to achieve the desired performance. Although BPTT(h;h’) has been shown to require less average computational effort per time step than comparable general-purpose on-line recurrent network learning algorithms, it is equally important to determine how it compares with these other algorithms in terms of the number of time steps required and success rate obtained when training particular networks to perform particular tasks. Any speed gain from performing a simplified computation on each time step is of little interest unless it allows successful training without inordinately prolonging the training time. A number of experiments have been performed to compare the performance of the improved version of truncated BPTT with its unmodified counterpart. In particular, studies were performed in which BPTT(2h; h) was compared with BPTT(h) on specific tasks. The results of these experiments have been that the success rate of BPTT(2h; h) is essentially identical to that of BPTT(h), with the number of time ticks required to find a solution comparable for both algorithms. Thus, in these experiments, the actual running time was significantly reduced when BPTT(2h; h ) was
Algorithm for Training Recurrent Networks
499
used, with the amount of speedup essentially equal to that predicted by an analysis of the computational requirements per time step. Among the tasks studied were several described in Williams and Zipser (1989a) and elaborated on in (Williams and Zipser 1989b). One noteworthy example is the "Turing machine" task, involving networks having 12-15 units. In earlier experiments, reported in Williams and Zipser (1990), it had been found that BPTT(9) gave a factor of 28 speedup in running time over RTRL on this same task, with success rate at least as high. Using BPTT(16;8) gave an additional factor of 2 speedup: making BPTT(16;8) well over 50 times faster than RTRL on this task.
6 Discussion
Note that as the algorithm has been described here, it actually involves a very uneven rate of use of computational resources per time step. On many time steps, no significant computation is performed, while at certain times much higher peak computation than the average is required. For simulation studies, this is of no consequence; the only noticeable effect of this will be that simulated time runs unevenly. However, one might also consider applying recurrent network learning algorithms to real-time signal processing or control problems, in which it is important to spread the computation evenly over the individual time steps. In fact, this algorithm can be implemented in such a fashion. The idea is to interleave the backpropagation computation with the forward computation. For example, if h = 2h', one should backpropagate through 2 time steps during each single (external) time step. In general, this requires a buffer that can hold the appropriate data for a total of h h! time steps. It is also appropriate to make some additional observations concerning the degree of approximation involved when the backpropagation computation is truncated to h prior time steps [whether using BPTT(h) or BPTT(h;h')]. If weights were held constant over the history of operation of the network, the true gradient would be that computed by backpropagating all the way back to the time of network initialization. Even in this case, however, it may well be that this computation undergoes exponential decay over (backward) time, in which case the difference between truncating or not can be negligible.5 In situations when there
+
4Careful analysis of the computational requirements of BMT(9) and of BPTT(16;8), taking into account the fixed overhead of running the network in the forward direction that must be borne by any algorithm, would suggest that one should expect about a factor of 4 speedup when using BPTT(16;8). Because this particular task has targets only on every other time step, the use of BPTT(9) here really amounts to using BPTT(9;2), which therefore reduces the speed gain by essentially one factor of 2. general, whether such exponential decay occurs depends on the nature of the forward dynamics of the network; for example, one would expect it to occur when the network is operating in a basin of attraction of a stable fixed point.
500
R. J. Williams and J. Peng
is no such exponential decay, though, truncation may give a very poor approximation. However, if weights are adjusted as the network operates, as they necessarily must be for any on-line algorithm, use of a gradient computation method based on the assumption that the weights are fixed over all past time involves a different kind of approximation that can actually be mitigated by ignoring dependencies into the distant past, as occurs when using truncated BPTT. Such information involving the distant past is also present in RTRL, albeit implicitly, and Gherrity (1989) has specifically addressed this issue by incorporating into his continuous-time version of RTRL an exponential decay on the contributions from past times. Unlike the truncation strategy, however, this does not reduce the computational complexity of the algorithm. Another potential benefit of the truncation strategy is that it can help provide a useful inductive bias by forcing the learning system to consider only reasonably short-term correlations between input and desired output; of course, this is appropriate only when these short-term correlations are actually the important ones, as is often the case. We believe that some combination of these effects is responsible for our experimentally observed result that not only is it detrimental to have too short a history buffer length, it can also be detrimental to have too long a history buffer. Studies reported in Williams and Zipser (1990), in which RTRL was compared with BPTT(h), are also consistent with this result. These other studies found that BPTT(9) not only ran faster than RTRL on the "Turing machine" task, as described above, but it also had a higher success rate. Since RTRL implicitly makes use of information over the entire history of operation of the network, this can be taken as evidence of the benefit of ignoring all but the recent past. Finally, we should point out that although we have focused on the use of discrete time here, the algorithm described by Pearlmutter (19891, which can be viewed as a continuous-time version of epochwise BPTT, can serve as the basis for a correspondingly efficient continuous-time version of the on-line algorithm presented here. In this regard, it is useful to consider the effect of using BPTT(h; h') on discrete-time networks arising from Euler discretization of continuous-time networks. As the step size in the Euler discretization is made smaller, it makes sense for the buffer size h to be made proportionately larger to span the same continuous-time interval, and, of course, this means incurring greater storage overhead. However, also making h' proportionately larger causes the weight updates to occur at the same continuous-time rate, and, more significantly, means that the total computation per discrete time step is not increased. Of course this does lead to more computation per interval of continuous time, just as in the computation of the forward propagation of activation in the network. The point is that this cost is not incurred doubly, as it would be if BPTT(h) were to be used.
Algorithm for Training Recurrent Networks
501
Acknowledgments This research w a s partially supported by Grant IN-8921275 from the National Science Foundation.
References Almeida, L. B. 1987. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. Proc. l E E E First Int. Conf. Neural Networks, 11, 609418. Cleeremans, A., Servan-Schreiber, D., and McClelland, J. L. 1989. Finite-state automata and simple recurrent networks. Neural Comp. 1, 372-381. Gherrity, M. 1989. A learning algorithm for analog, fully recurrent neural networks. Proc. Int. loint Conference Neural Networks, 1, 643444. Pearlmutter, B. A. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1,263-269. Pineda, F. J. 1987. Generalization of backpropagation to recurrent neural networks. Phys. Rev. Lett. 18, 2229-2232. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1: Foundations, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, eds. MIT Press/Bradford Books, Cambridge, MA. Williams, R. J., and Zipser, D. 1989a. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1, 270-280. Williams, R. J., and Zipser, D. 198913. Experimental analysis of the real-time recurrent learning algorithm. Connection Sci. 1,87-111. Williams, R. J., and Zipser, D. 1990. Gradient-based learning algorithms for recurrent connectionist networks. Tech. Rep. NU-CCS-90-9. Northeastern University, College of Computer Science, Boston. Zipser, D. 1989. A subgrouping strategy that reduces complexity and speeds u p learning in recurrent networks. Neural Comp. 1, 552-558.
Received 11 June 90; accepted 14 August 90.
Communicated by Gail Carpenter
Convergence Properties of Learning in ARTl Michael Georgiopoulos Department of Electrical Engineering, University of Central Florida, Orlando, F L 32816 U S A Gregory L. Heileman Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM 87132 U S A Juxin Huang Department of Electrical Engineering, University of Central Florida, Orlando, F L 32816 U S A We consider the ARTl neural network architecture. It is shown that in the fast learning case, an ARTl network that is repeatedly presented with an arbitrary list of binary input patterns, self-stabilizes the recognition code of every size-1 pattern in at most 1 list presentations. 1 Introduction
A neural network architecture for the learning of recognition categories was derived by Carpenter and Grossberg (1987). This architecture was termed ARTl in reference to the adaptive resonance theory introduced by Grossberg (1976). It was shown in Carpenter and Grossberg (1987) that ARTl self-organizes and self-stabilizes its recognition codes in response to arbitrary orderings of arbitrarily many and arbitrarily complex binary input patterns. In this paper, only the fast learning case is considered. We show that if ARTl is repeatedly presented with a list of binary input patterns it self-stabilizes the recognition code of every size-1 pattern in at most 1 list presentations. (A size-1 input pattern is a binary vector containing 1 components with value one and the remaining components with value zero.) An immediate consequence of this result is that if the input patterns in the input list can be represented by binary vectors of dimensionality M , with the size-0 and size-M vectors excluded from the list (one of our modeling assumptions in Section Z), then ARTl learns and recognizes the list in at most M - 1 list presentations. This result is valid independent of the ordering with which the input patterns are presented within the list. Neural Computation 2, 502-509 (1990) @ 1990 Massachusetts Institute of Technology
Convergence Properties of Learning in ARTl
503
In short, this paper provides useful upper bounds on the number of list presentations required to learn a list of input patterns presented repeatedly to ARTl. The modeling assumptions are presented in Section 2. In the same section the tightness of the upper bounds is exploited by examining two extreme examples. In Section 3, the results are stated and proven. Concluding remarks are contained in Section 4. 2 Model - Preliminaries
A complete description of ARTl and the theorems that give insight into its operation are provided in Carpenter and Grossberg (1987). An ARTl network consists of two layers of neurons (nodes), called the FI and F2 layers. Input patterns are presented at the F1 layer. Every node in the Fl layer is connected via bottom-up traces to all of the nodes in the F2 layer. Every node in the Fi layer is likewise connected to all of the nodes in the F1 layer via top-down traces. The results of this paper are proven under the following assumptions: .
Al: All hypotheses of section 18 in Carpenter and Grossberg (1987) hold (one of these hypotheses is that fast learning occurs) A2: L - 1 5 111-'
A3: 1 5 111 5 M
-
1
A4: F2 has enough nodes to code all the patterns at every presentation of the input list
where 111 is the size of an arbitrary pattern I in the input list, M is the number of nodes in the Fl layer, and L is a parameter associated with the adaptation of bottom-up and top-down traces in the ARTl neural network architecture. The top-down traces that emanate from a node in the F2 layer are called templates. Assume that a pattern I which belongs to a list of binary input patterns is presented to ARTl. Furthermore, assume that at the nth presentation of the list, pattern I activates some node k in the F2 layer and furthermore k codes I . We denote by V& the template that corresponds to node k after k has learned I . We say that I is coded by V;tn or that VL, has coded the pattern I ; V:n is referred to as a learned template. To prove our results, the templates of ARTl need to be considered either prior to a pattern's presentation, or after a template has coded a pattern. For the purposes of the results discussed in this paper, the ARTl templates can always be thought of as binary vectors. Actually, when the top-down trace of a template is taken as either zero or one it means that the trace is either small enough, or large enough to satisfy the 2 / 3 Rule of the ARTl network (for more details see Carpenter and Grossberg 1987). Consider a pattern I in the list and a template V corresponding to an F2 node. There is a one-to-one correspondence between the components
504
M. Georgiopoulos, G. L. Heileman, and J. Huang
of the binary vectors I and V . A component of I corresponds to a component of V if both of them activate the same F1 node. We define, as in Carpenter and Grossberg (19871, three types of learned templates with respect to an input pattern I : subset templates, superset templates, and mixed templates. The components of a subset template V satisfy V C I . They are one only at a subset of the corresponding I components. The components of a superset template V satisfy V 3 I . They are one at all the corresponding components of I that are one, as well as at some components of I that are zero. The components of a mixed template V are one at some, but not all of the corresponding I components, as well as at some of the components of I that are zero. In this case, the set of the V components that are one is neither a subset nor a superset of the set of the I components that are one. Sometimes it is convenient to refer to a pattern I as being a subset, superset or mixed pattern with respect to a template V if I C V , I 3 V , or V is a mixed template with respect to I . Besides the learned templates described above, we also define a template V to be an uncommitted template if it corresponds to a node that has not coded any pattern yet. We assume that the components of an uncommitted template consist of all ones. Since an input pattern I is a binary vector and a template V can be thought of as a binary vector, we define by 111 and IVI the size of the binary vectors I and V , respectively. We also define a template V to be a stable templafe if and only if, after its creation, it cannot be destroyed by future pattern presentations. We say, as in Carpenter and Grossberg (19871, that a pattern I has direct access to template V if presentation of I leads at once to activation of the F2 node with corresponding template V , and this template codes 1 on that trial. Finally, if I is a pattern of the input list and V is a template of ARTI, we define I n V as the binary vector with ones only at components where both the I and V components are one, and zeros at all other components. Let us now present two examples that are extreme cases and demonstrate clearly the tightness of the bounds mentioned in Section 1. To follow these examples the reader needs to be aware of Theorems 1and 7 in Carpenter and Grossberg (1987). In the first example, ART1, with a vigilance parameter p = 1 is repeatedly presented with a nested list of input patterns in order of decreasing size. In particular, the input list, {II,I,, . . . ,I L ~ - ~ }is, such that 11 c I2 c . . . c I M - ~with 141 = k , and it is presented in order Iw-1, I M - ~.,. . , I1, I M - 1 , I M - 2 , . .. , I,, etc. Then, in the first list presentation only template 6 = I1 is created (see Theorem 7 of Carpenter and Grossberg 1987). Template Vl cannot be destroyed thereafter because all patterns are supersets or equal to template K. Hence, template V, is a stable template. In list presentations 2 2 pattern Zl will have direct access to template I 4 (see Theorem 1 in Carpenter and Grossberg 1987). As a result, the recognition code of pattern I1 (i.e., 6)self-stabilizes in exactly one list presentation. In the second list presentation only template
Convergence Properties of Learning in ARTl
505
V2 = 12 is created. Template V, cannot be destroyed thereafter because all patterns other than pattern 11are supersets or equal with V2, and pattern 1, is coded by the stable template q. In list presentations 2 3 pattern 1 2 will have direct access to template Vz. Hence, the recognition code of pattern 12 (i.e., Q) self-stabilizes in exactly two list presentations. Working similarly for the rest of the input patterns we can prove that ARTl self-stabilizes the code of a size-1 (3 5 1 5 M - 1) pattern in exactly 1 list presentations. This example corresponds to the extreme case where the upper bound on the number of list presentations required by ARTl to self-stabilize the recognition codes of size-1 patterns is attained. In the second example, ARTl, with a vigilance parameter p = 1, is presented with a nested list of input patterns in order of increasing size. The input list, {Zl,1 2 , . . . , I A ~ - , }is , such that Il c 12 c . . . c 1 ~ f - 1with (Ii;\= k , and it is presented in order I 1 , 1 2 , .. . , I M - ~I ,I ,I,,. . . , I M - I , etc. Then, in the first list presentation the templates. 8 = Il, 1 5 1 5 M - 1 are created. In the second list presentation pattern It, 1 < 1 5 M - 1, will have direct access to template 8,1 5 1 5 M - 1. As a result, ARTl self-stabilizes the code of every pattern in the input list in exactly one list presentation. This example is another extreme case where the number of list presentations required by ARTl to self-stabilize the recognition codes of size4 patterns attains its lowest possible value (i,e., one list presentation). Carpenter and Grossberg (1987)made the following conjecture: Under their hypotheses of section 18, if Fz has at least N nodes, then each member of a list of N input patterns that is presented cyclically to ART1 will have direct access to an F2 node after at most N list presentations. In this paper, under assumptions A1 through A4 we prove a much stronger result, at least for most cases of interest. The result states that the size of the pattern determines the upper bound on the number of pattern presentations required by ART1 to learn the pattern. In particular, a size1 (1 < 1 M - 1) pattern requires at most 1 list presentations. One of the cases where the conjecture is stronger corresponds to the situation where the input list contains N patterns with N < A4 - 1. Considering though that N is an integer between 2 and 2M - 2 (patterns of size-0 or size-M are excluded), the result of this paper is stronger than the conjecture for most cases of interest.
<
3 Results
We first state two Lemmas that are going to be useful for the proof of our results. Lemma 1 is valid under assumptions Al-A3. Lemma 1. Suppose that I is an arbitrary pattern from the input list. Learned subset templates with respect to I are searched first in order of decreasing size (i.e., the closest learned subset template to I is searched first, and if it is reset, the next closest subset template to I is searched and so on). If all learned
M. Georgiopoulos, G. L. Heileman, and J. Huang
506
subset templates are reset, then superset and mixed learned templates, as well as uncommitted templates are searched, not necessarily in that order. Lemma 1 is a shortened restatement of Carpenter and Grossberg's (1987) Theorem 7 and its proof can be found there. Let us now assume that an input pattern I is presented at Fl. The activity at F1 changes from 0 to I . Let us also assume that a node in F 2 with template V, is searched first. The activity at Fl changes to I n K . If ~ I n V , ~ 2~ pI then ~ ~ 'template V, codes pattern I . If II n V,llIl-' < p , then the node with template V, is reset and another node in F 2 is searched. The parameter p , called vigilance, determines whether the top-down template of an F 2 node is a good match of the input pattern I . It is obvious by the description of this reset mechanism, that if a template V, is searched first and reset (i.e., II n Vi IIIl-' < p ) then any other template 6that is searched later will be reset if 11n &[lI[-l5 11n b$lIIl-'. Lemma 2 is an immediate consequence of Lemma 1 and the above discussion. Lemma 2 is valid under assumptions A1-A3. Lemma 2. Suppose that I is an arbitrary pattern from the input list, V, is a learned subset template (with respect to I ) , and V, is an arbitray mixed learned template (with respect to I ) , prior to I's presentation. Then, if Vi is reset and 6is searched, V, will be reset if
Our results are now presented in a form of a theorem. The theorem is valid under assumptions Al-A4. Theorem 1. Consider an arbitrary list of binary input patterns that is repeatedly presented to ART1. Then, in list presentations > x, where x 2:
>
T1:
A pattern I of size 2 x cannot be coded by a mixed template V , such that
Ilnvl5x--1. T 2 A pattern I of size 5 x
- 1 will have direct access to a stable template that has been created in list presentations 5 x - 1.
T1 is obviously true for z > M , because according to the assumptions of the theorem there are no patterns of size 2 M in the input List. Hence, if we prove T1 for 2 5 x 5 A4 - 1 we have proven T1 for all x. Furthermore, it is easy to see that if we prove T2 for 2 5 x 5 A 4 we have proven T2 for all x. We will prove T1 for 2 5 x 5 M - 1 and T2 for 2 5 x 5 A4 in two steps. In step 1, we prove that T1 and T2 are valid for x = 2. In step 2, we will show that for every R, 3 5 n 5 M , the assumption that T1 and T2 are valid for all 2 5 x 5 n - 1 implies their validity for x = n. This iterative procedure guarantees the validity of T1 and T2 for all x, such that 2 5 x 5 M , and consequently the validity of the theorem for all x 2 2.
Convergence Properties of Learning in ART1
507
Step 1. Prove that T1 and T2 are valid for z = 2. Consider a pattern I of size 2 2. At all times, prior to Z’s appearance in list presentations 2 2, there exists a learned subset template V of Z. Hence, according to Lemmas 1 and 2, Z cannot be coded by a mixed template of size 5 1. This proves T1 at z = 2. Now assume that a pattern I of size-1 has been coded by the template V& in the first list presentation. V& cannot be destroyed thereafter; hence, V;tl is a stable template. Furthermore, after the creation of V; no other template equal to V 4 can be created (see Lemmas 1 and 2). As a result, in list presentations 2 2, the size-1 pattern Z will have direct access to its equal V,’j template (see Lemma 1). The stable template VA was created in the first list presentation. This proves T2 at T = 2.
Step 2. Pick an n such that 3 5 71 5 M and assume that T1 and T2 are valid for every I, such that 2 5 II’ 5 n - 1. It will now be shown that T1 and T2 are true for 5 = 71. Proof of T1 at I = n,.Consider a pattern I of size IZ( 2 n. Assume that I was coded by VLIPl in list presentation n - 1 and \Vc!-l I = 1. Two cases are distinguished: (a) 1 5 n - 1. The template V;t,-, can be destroyed by either (1) a size-k ( k < 1 ) pattern, or by (2) a mixed pattern f that is coded by Vcn-,, where II^ n V;fjP1I = k ( k < 1). All size-k ( k < 1 5 71, - I ) patterns have direct access to stable templates that have been created by the end of list presentation n - 2. This is due to the validity of T2 for all z such that 2 5 z 5 n-1; hence, (1)cannot happen. Furthermore, in list presentations 2 n - 1, (2) cannot happen either, due to the validity of TI for all z such that 2 5 I 5 n - 1. As a result, the VTn-, template of size 1 5 n - 1 is stable, and pattern I will be coded in list presentations 2 72 by the subset template V:n--l, or by some other subset template of size larger than or equal to the size of V:,?-, (see Lemma 1). In short, I cannot be coded by any mixed template. (b) 1 2 n. The template VclP1 can be refined to a size-k template, k 2 n - 2, prior to Z’s appearance in future list presentations; k cannot be smaller than n - 1 due to the validity of T1 and T2 at all s such that 2 5 s 5 n - 1. So, in list presentations 2 n, I will always have access to a subset template of size at least n - 1. Hence, in list presentations 2 n, the pattern Z cannot be coded by a mixed template V , such that IZ n VI 5 n - 1 (see Lemma 2). The above arguments prove T1 at :I: = n. Proof of T2 at .T = n. The result is obvious for a pattern Z of size < 72 - 1 due to the validity of T2 for all .I: such that 2 5 2 5 n - 1. Let us now take a pattern Z of size I I I = n - 1. Suppose, once more, that I was coded by Vlr,Pl in list presentation n - 1 and lV2T-,l = 1. We distinguish two cases: (a) 1 = n - 1. Due to the discussion in the proof of TI for J: = n, case (a), we conclude that the template VCIP1is stable. Furthermore, after
508
M. Georgiopoulos, G . L. Heileman, and J. Huang
the creation of the template V;fn-,= I , no other template equal to V;n-l can be created (see Lemmas 1 and 2). As a result, in list presentations 2 n the size+-1) pattern, I , has direct access to its equal V&, template (see Lemma 1). The stable V;fn-,template was created in a list presentation In-1. (b)1 < n - 1. Note that in this case, VTn-, can code I . The template V&, is stable, and new size-C templates (1 5 5 5 n--2) cannot be created in list presentations 2 n - 1, due to the validity of T1 and T2 for all s such that 2 5 s I n - 1. A template equal to I can be created prior to the end of list presentation n - 1. After the end of list presentation n - 1, knowing that 1 can be coded by the stable subset template VZnp1, a template equal to I can be created only if a pattern i is coded by a mixed template V such that i n V = I ; but this is impossible due to the validity of T1 at z = n as proved above. If a template equal to I is created prior to the end of list presentation n - 1, then this template is stable [see the discussion in the proof of T1 at z = n, case (a)]. No other template equal to I will be created thereafter (see Lemmas 1 and 2). Hence, in list presentations 2 n either the stable template I or the stable template Vzn-, will code pattern I . In both cases, the stable template that codes I is created in list presentations I n - 1. The proof of T2 at z = n is now complete. Consequently, the theorem is true 0 . Note: As mentioned before, the proof of T1 at z = M is obvious because the assumptions of the theorem exclude patterns I of size greater than or equal to M . As a result, for the proof of T1 at z = M , it is not necessary to go through the arguments presented in the proof of T1 for 2 < M. In the following, the conclusions of the theorem, as well as certain important byproducts of its proof, are presented as properties of learning in the ARTl network. Once more, it is assumed that an arbitrary list of binary input patterns is repeatedly presented to ARTl. In list presentations 2 z, where z 2 2, learning in ARTl has the properties: P1: A pattern I of size 2 z cannot be coded by a mixed template V , such that ) I n VI 5 3: - 1. P 2 A pattern Z of size 5 z - 1 will have direct access to a stable template, that was created in list presentations 5 z - 1. P3: Size-(z - 1) templates cannot be created.
P4: Size-a: templates cannot be destroyed. The basic result of this work is that if an ARTl network is presented repeatedly with an arbitrary list of binary input patterns it self-stabilizes the recognition codes (templates) of size-l patterns in at most 1 list presentations. This basic result is an immediate consequence of property P2. It is worth observing that properties Pl-P4 are valid independent of the order
Convergence Properties of Learning in ARTl
509
in which the input patterns are presented within the list. In addition, the ordering of the patterns within the list can change from one list presentation to the next without affecting the validity of these properties. Finally, the basic result implies that if the input patterns can be represented by Af input nodes, ARTl learns and recognizes the list after at most A4 - 1 list presentations (size-0 and size-M patterns have been excluded from the input list). 4 Conclusion
An important self-organizing neural network, ARTl, introduced and analyzed by Carpenter and Grossberg (1987) was considered. The convergence properties of any neural network model is an issue of fundamental importance. Carpenter and Grossberg have proven a multitude of ARTl properties, including certain of its convergence characteristics. In this work, we concentrated only on the convergence properties of ARTl in the fast learning case. In particular, under the modeling assumptions of Section 2, a size-1 pattern from a list of binary input patterns presented repeatedly to ARTl has di,rect access to a stable code in at most 1 list presentations. Hence, each member of a list of binary input patterns presented repeatedly to ARTl will have direct access to a stable code after at most A4 - 1 list presentations (size-0 and size-A4 patterns are excluded from the input list). Other useful properties associated with learning in the ARTl network were also shown.
Acknowledgments This research was supported by a grant from the Florida High Technology and Industry Council with matching support from Martin Marietta Electronic Systems Division.
References Carpenter, G. A., and Grossberg, S. 1987. A massively parallel architecture for a self-organizing neural pattern recognition machine. Cornpuf. Vision, Graphics, frnagtTProcess. 37, 54-115. Grossberg, S. 1976. Adaptive pattern recognition and universal recoding. 11: Feedback, expectation, olfaction, and illusions. BioI. Cybernet. 23, 187-202.
Received 18 April 90; accepted 6 August 90.
Communicated by Les Valiant
A Polynomial Time Algorithm That Learns Two Hidden Unit Nets Eric B. Baum NEC Research Institute, 4 Independence Way, Princeton NJ 08540 U S A
Let N be the class of functions realizable by feedforward linear threshold nets with n input units, two hidden units each of zero threshold, and an output unit. This class is also essentially equivalent to the class of intersections of two open half spaces that are bounded by planes through the origin. We give an algorithm that probably almost correctly (PAC) learns this class from examples and membership queries. The algorithm runs in time polynomial in n, 6 (the accuracy parameter), and 5 (the confidence parameter). If only examples are allowed, but not membership queries, we give an algorithm that learns N in polynomial time provided that the probability distribution D from which examples are chosen satisfies D ( z ) = D ( - z ) Vz. The algorithm yields a hypothesis net with two hidden units, one linear threshold and the other quadratic threshold.
1 Introduction
The perceptron algorithm (Rosenblatt 1962) is the launching point for much of modern neural network research. This algorithm provably finds a classifier for any set of linearly separable examples, and thus has applications, for example, to pattern recognition tasks. Unfortunately, as was stressed by Minsky and Papert (1969), the perceptron algorithm does not work in polynomial time. However, Khachian (1979) and also Karmarkar (1984) have provided algorithms for classifying linearly separable examples that are polynomial time. Blumer et al. (1987) subsequently showed how such polynomial time classification algorithms could be employed to produce polynomial time PAC learning algorithms. Minsky and Papert (1969) also stressed that linear separability is too strong a condition to expect, and thus learning algorithms must deal with more complex target functions to be practical. Thus attention is now focused on networks with hidden units. The main learning heuristic is backpropagation (Le Cun 1986; Rumelhart et al. 19861, and a major question is whether there is any algorithm for training networks with hidden units that scales well, that is, will work rapidly for large networks. Neural Computation 2, 510-522 (1990) @ 1990 Massachusetts Institute of Technology
Polynomial Time Algorithm
511
Figure 1: A feedforward linear threshold net with two hidden units. All nets discussed in this paper have one hidden layer of units. Input units are connected to hidden units, which are connected to the output unit. No connections directly from input to output unit are allowed.
The simplest such networks are the sort we will study in this paper, which have two hidden units each of threshold zero (see Fig. 1). The learning problem for such networks is equivalent (as we will remark) to learning functions described by intersections of half spaces. There has been intensive work on this problem (see, e.g., Ridgeway 1962; Blumer et al. 1989; Baum 1990a). A negative partial answer was provided by Blum and Rivest (1988) who showed that there is no polynomial time algorithm that can solve the loading problem for networks with two hidden units. This obstruction can be evaded if we consider more flexible algorithms (see Baum 1989 for a review). Their result means it is unlikely we will find an algorithm that trains neural networks to their full potential. However, even if we could train a network with two hidden units to its potential, it could only learn target functions that can be represented by such a network. We might just as well train a larger network to realize such target functions. In this paper we give an algorithm that provably learns such functions, and produces as its output a feedforward net with two hidden units; however, one of these two units is a quadratic threshold unit, rather than linear threshold.
512
Eric B. Baum
We work first within the PAC learning model (Valiant 1984). In this model a class C of boolean functions is called learnable if there is an algorithm A and a polynomial p such that for every n, for every probability distribution D on X2", for every c E C, for every 0 < ~ , < b 1, A calls examples and with probability at least 1 - 6 supplies in time bounded by p ( n ,s, €-I, & I ) a function g such that
Here s is the "size" of c, that is, the number of bits necessary to encode c in some "reasonable" encoding. This model thus allows the algorithm A to see classified examples drawn from some natural distribution and requires that A output a hypothesis function which with high confidence (1 - 6) will make no more than a fraction t of errors on test examples drawn from the same natural distribution. This model thus corresponds reasonably well to what is often desired in practice, for example, in applications of backpropagation. Let N denote the class of boolean functions defined by feedforward nets of linear threshold units with two hidden units, both of threshold zero. We will supply for this class an algorithm fulfilling the above conditions provided that one restrict to probability distributions with an inversion symmetry: D ( z ) = D(-rc). We will say a class C is i-learnable in this case (i.e., if we supply an A that satisfies the above conditions when we restrict to inversion symmetric distributions). The inversion symmetric condition is fulfilled for some natural distributions, for example, for the uniform distribution on the unit sphere, s". It is evidently not likely to hold exactly in practice. Although our proof will fail if this condition is violated, it is plausible (and an interesting open problem) that related methods might be robust to small violations of this symmetry. We will return to this point in Section 3. A natural way to increase the power of the learner is to allow, in addition to examples, membership queries (Valiant 1984). Here the algorithm A supplies an z, and is told the classification c(z). This protocol is thus appropriate when we have a teacher who can answer questions, for example, a human expert, or when we can experiment. We believe learning with membership queries is far closer to the way people learn than simply through examples, and that it is quite a reasonable extension in practical applications, for example, in optical character recognition or speech recognition, where a human expert could classify examples posed by a learning algorithm. This protocol has also been studied (see, e.g., Angluin 1988),but it has been largely overlooked in the neural net literature. We will give here an algorithm that in polynomial time learns N for arbitrary distributions when membership queries are allowed. A separate publication (Baum 1990b) will describe more powerful algorithms that are able to learn from examples and queries, in polynomial time, substantially wider classes than N .
513
Polynomial Time Algorithm
2 Preparatory Remarks on PAC Learning
We will make use of the following theorem. Theorem 1. (Blumer et al. 1988). Let C' be a well behaved concept class of VC dimension' d. Then if we call
(4
::
M o ( t l6,d ) = max - log -, -log
-
lE3)
examples (x,t)from any distribution D' on R" x ( 0 , l ) and find a c E C consistent with these examples (i.e., C ( T ) = t ( x )for all these examples) then we have confidence 1 - 6 that Prob(,..,,,ErY[4Z) # t(Z)l < f We must next introduce a specification of the size of our target functions, which will depend on the number of bits of precision a with which we work. The reason why we need such a notion is the following. Say we wished to learn target functions specified by a single half plane. If an adversary were allowed to cluster the probability distribution D arbitrarily close to the plane, then we might need an arbitrary amount of time to find a plane separating positive from negative examples, since, for example, Khachian's or Karmarkar's algorithms take time depending polynomially on the number of bits of precision with which one works. One could avoid this problem by imposing a continuity condition on the probability distribution D, or by restricting the accuracy with which the target function is defined. Instead we will adopt the simplest solution for our case, which is to assume that the examples lie on a lattice of spacing 2-", that is, for any example II: E 2T"2",and U ( z ) = 0 for I/ z II> 1. With this assumption, Karmarkar's algorithm will be able to find linear separators in time polynomial in a, n,, and the number of examples being separated. 3 Learning from Examples Only
We first remark that the problem of learning such feedforward nets is trivially equivalent to learning an intersection of half spaces. The "neurons" in our nets are linear threshold functions: these take value 1 in a half space and zero in the complementary half space. The planes corresponding to the two hidden units thus divide the n-dimensional input 'The Vapnik-Chervonenkis (VC) dimension is defined as follows. Let C be a class of boolean functions and 5' a set of points. We say C shatters 5' if C: induces all boolean functions on '5'. The V C dimension of C is the size of the largest set C shatters. For example, it can be shown that the VC dimension of half spaces bounded by planes in 8'l is 'ri. + I, and the VC dimension of half spaces bounded by surfaces of degree X. is U ( n ' ) (Wenocur and Dudley 1981).
Eric B. Baum
514
d
b
C Figure 2: The hyperplanes corresponding to the two hidden units divide the input space into four regions. space into four regions (see Fig. 2) and the value of f (x)depends only on which of the four regions x lies in. The position of these two planes and thus the four regions depends on the weights to the hidden units. Depending on the weights to the output unit, our net may have as positive regions any one of these regions, for example, {a}; any two contiguous regions, for example, {a, b}, but not {a, c}; any three regions, for example, {a, b, c}; or all four regions (or none). It is evident that each of these cases can be viewed as the intersection of two half spaces; or in the case of three regions, the complement of the intersection, which becomes an intersection problem if we simply reverse our definition of positive and negative examples. Thus it suffices to consider learning an intersection of two half spaces. Let F be the class of functions described as the intersection of two open half spaces on W. Thus f E F may be described by giving two
Polynomial Time Algorithm
515
vectors w l , w2 E P, and f(n.) = 1 if w1 . x > 0 and w2 . o' > 0, else f ( x ) = 0.2 Let G be the class of functions we call the XOR of two half spaces, and define as follows. g E G is described by giving two vectors w1, w2 E W, and g(z) = 1 if (wl . z)(u12 . r ) > 0, else g(x) = 0. Theorem 2. (Blum 2989; Valiant and Warmuth 1989). G is learnable. Proof. Given a set ,S of examples [x,, g(zt)] we may find a consistent classifier by the following trick. Find a w E Rn2 such that C,, > 0 for any positive example x E S and C,, wyxzx3< 0 for any negative example in S. Such a w exists by the definition of G since C,, ~ 1 ~ ~ 1 2 ,isz ,greater x~ than or less than zero according to whether 2 is a positive or negative example. Finding such a w is a simple linear programming problem and may be accomplished using Karmarkar's algorithm (Karmarkar 1984). Now we have found a consistent classifier in the set H of half spaces on %"', which has VC dimension n2 + 1. By Theorem 1 this solves our problem if we use A I o ( f , 6, n2 + 1) examples. Q.E.D.
Now we will solve the i-learning problem for an intersection of two half spaces by reducing to the problem of learning XORs. We are trying to find a region r defined by w1 . T > 0, w2 . L > 0 (see Fig. 3). Let i. denote the region w1. z < 0, w2 . n. < 0. The idea is the following. We will find a closed half space h containing very little measure for positive examples [i.e., D ( r n h) << t] and almost all measure in i [i.e., D ( i n h ) <<< 61. (6denotes the open half space which is the complement of h.) We may easily find such an h by collecting a sufficiently large number M I of positive examples and finding a plane through the origin having all of these on one side. Theorem 1 then guarantees us that almost all positive measure is on this side. Because of the symmetry we are requiring on the distribution, if almost all measure in T is on one side of this plane, almost all measure in i is on the other side. Now we may safely hypothesize that any point in h is a negative example, and we have thus reduced the problem to classifying points in A. We next collect a set S of examples in h . Since with very high probability, none of the points in S lies in i , the set S will be consistent with the XOR of two planes. We will thus use the XOR algorithm. This will produce a U' E arl*.Our hypothesis function will then be z is positive if and only if .r E 5 and C,, c,sJ > 0. The algorithm is the f ~ l l o w i n g . ~
+
+
1. Let ill, = i L f 0 ( 6 ~ / [ l 6 M o ( c6/4, , n2 1)],6/8,n 1) = #(n'/c26).Call examples until either (a) we have MI positive examples or (b) we have called A4 = max(2M1/ E , :ln6-') examples (whichever comes first). In case (b), halt and output hypothesis g(x) = 0 (i.e., classify all examples as negative). ~
2We restrict to hyperplanes through the origin. "or ease of notation, define d(.r)to mean O j i ) x (terms of logarithmic size).
Eric B. Baum
516
2. Find a half space h bounded by a hyperplane through the origin such that all the positive examples are in 6. This can be done by Karmarkar’s algorithm.
3. Let S = { [ T ” , f(.c”)]} be the set of the first A4,3(c,6/4,n2+ 1) examples we found which are in h. Find a w E 9’‘’ s.t. CLJ ui,Jx,xJ > 0 for any positive example T E S and CIJU J , ~ ~ < , X 0~ for any negative example in S . w can be found, for example, by Karmarkar’s algorithm. 4. Output hypothesis 9: clJw,J3‘,.cJ > 0.
.c
is positive if and only if x E h and
5. Halt.
Figure 3: The input space is divided by target hyperplanes w1 and w2 and the hyperplane ah, which bounds half space h, that we draw in step 2 of our algorithm. Region T = {z : w1 .x 2 0, w2 .s 2 0) is labeled, as is 7‘ = {z : -x E T } .
Polynomial Time Algorithm
517
Notice that the hypothesis function 9 output by this net is equivalent to a feedforward threshold net with two hidden units. One of these is a linear threshold with output Q(C,ui(lz,), where w h is the normal to the boundary hyperplane of h. The other is a quadratic threshold with J ) . Here H is the Heaviside threshold function and z, is We now prove this algorithm is correct, provided D is inversion symmetric, that is, for all .r, D ( r ) = D(-.r). Theorem 3. The class E' of intersections of two half spaces is i-learnable.
Proof. It is easy to see4 that with confidence 1-0, if step (lb) occurs, then at least a fraction 1 - f of all examples are negative, and the hypothesis g(.r) = 0 is f-accurate.s Likewise, if step (lb) does not occur, we have confidence 1 - 6/4 that at least a fraction f/4 of examples are positive.6 If step (lb) does not occur, we find an open half space fi containing all the positive examples. By Theorem 1, with confidence 1 - 6/8, we conclude that the measure l l ( h n r ) < 0 ~ / 1 6 A l 0 (6/4. ~ , n2 + 1). Now if .r E i., then -.I E r , and vice versa; and if .I' t h, then - r E 11. Thus if .r E fi n i. then - 1 ' E 12 n t . But I)(..) = D(-.r) by hypothesis. Thus the measure D ( h n C) < I l ( 1 1 n 1.1. Now we use bound the conditional probability Prob(.r E i.1.r E A ) that a random exampIe in h is also in i.: E h)= ~ n r( ) / ~i ( h5) ~ ( n ht . ) / D ( h ) Recall we saw above that with confidence 1 - 30/8, D ( h ) > t/4 and D ( h n r ) < &/16,kI,(t-',6/4, t 7 ' + 1). Thus we have
12) < ~/4A10(c~',5/4,n2 + 1) A l o ( f , 6/4, t i 2 + 1) random examples in 6. Since each of
Prob(.r t i1.r E
Now we take these AI" examples has probability less than 6/4Mo of being in i., we have confidence 1 - 6/4 that none of these examples is in i.. Thus we can in fact find a set of w,, as in the proof of Theorem 2, and by Theorem 1 (with confidence 1 - 6/41,this I P correctly classifies a fraction 1 - t of examples ow our hypothesis function is s is positive if and only > 0. With confidence 1 - 5 / 8 this correctly classifies all but a fraction much less than c of points in h and with confidence 1 - 76/8 it correctly classifies all but a fraction f of points in Ti. Q.E.D. *To see this define L F ; ( p . i n . r ) as the probability of having at most T' successes in independent Bernoulli trials with probability of success 7) and recall (Angluin and ' ~ / ~ . this formula Valiant 1979), for 0 5 i j 5 I, that f,E[P*,trt.(1 / j ) r n p ] 5 F / ' ~ ~ ~Applying with 7 n = M , p = f, and /j = 1 M , / M f yields the claim. "Thus if (lb) occurs we have confidence 1 - 6 we are f-accurate. The rest of the proof shows that if (Ib) does not occur, we have confidence 1 - 6 we are 6-accurate. Thus we have confidence overall 1 6 our algorithm is r-accurate. 'To see this define G E ( p .7 1 ) . r ) as the probability of having at least 4' successes in i n Bernoulli trials each with probability of success p and recall (Angluin and Valiant 1979), for 0 5 [j 5 1, that CE'(7).r i i . (1+ / l ) , r r t p ] 5 ~ - j j ' ~ ~ ~ Applying J ' / ~ . this formula with nr = M , ) J = f j 4 , and I f = 1 yields the claim. rri
~
~
~
518
Eric B. Baum
4 Some Remarks on Robustness The algorithm we gave in Section 2 required, for proof of convergence, that the distribution be inversion symmetric. It also used a large number 6(n3/t2S) examples, in spite of the fact that only O(n/t) examples are known to be necessary for learning the class N and only O ( n 2 / t ) examples are necessary for training hypothesis nets with one hidden linear threshold and one hidden quadratic threshold unit. Both of these restrictions arose so that we could be sure of obtaining a large set of examples that is not in ?, and that are therefore known to be quadratically separable. We proposed using Karmarkar’s algorithm for the linear programming steps. This allowed us to guarantee convergence in polynomial time. However, if we had an algorithm that was able to find a near optimal linear separator, for a set of examples that is only approximately linearly separable, this might be much more effective in pra~tice.~ For example, if in step 3 the set of examples we had was not exactly linearly separable, but instead contained by mistake either a few examples from ? or a few noisy examples, we might still find an t accurate classifier by using a robust method for searching for a near optimal linear separator. This would, for example, allow us to use fewer examples or to tolerate some variation from inversion symmetry. Note that our method will (provably) work for a somewhat broader range of cases than strictly inversion symmetric distributions. For example, if we can write D = D1+ D2, where Dl(x)= D1 (-x) > 0 everywhere and D 2 ( x ) 2 D2(-2) for all z E r, our proof goes through essentially without modification. (4need not be positive definite.)
5 Learning with Membership Queries The only use we made of the inversion symmetry of D in the proof of Theorem 3 was to obtain a large set of examples that we had confidence was not in i . The problem of isolating a set of examples not in ? becomes trivial if we allow membership queries, since an example y is in ? if and only if -y is a positive example. We may thus easily modify the algorithm using queries by finding a set S+containing examples in .i. and a set ,’$ containing examples not in ? (where membership in one or the other of these is easily established by query). We then find a half space h containing all examples in S+and correctly classify all examples in by the method of Theorem 2. A straightforward analysis establishes (1) how many examples we need for sufficient confidence, relying on Theorem 1,
s+
7We will report elsewhere on a new, apparently very effective heuristic for tinding near optimal linear separators.
Polynomial Time Algorithm
519
and ( 2 ) that we can with confidence acquire these examples rapidly. The algorithm is as follows. 1. Call examples, and for all negative examples y, query whether -y is negative or positive. Accumulate in a set S? all examples found such that y is negative but -y is positive. Accumulate in set S, all examples not in SF. Continue until either (a) we have M? = 11/10(~/2,6/2,n + 1) examples in SF and we have M = Mo(c/2,6/2, n2+ 1)examples not in S?;or (b) one of the following happens. ( b l ) If in our first Mcut= max[4M;/t, 16t-’ln(26-’), 2I%?/c]examples we have not obtained M+ in S,, proceed to step 5. (62) If instead we do not find &’ examples in SFin these first Mat calls proceed to step 6.
2. Find a half space h bounded by a hyperplane through the origin such that all examples in Sf are in h and all positive examples are in h. h can be found, for example, using Karmarkar’s algorithm. 3. Find a w E iR”’ s.t. C LwtJx,xj j > 0 for any positive example x E 3, and C i jw,jlc,xJ < 0 for any negative example in $. w can be found, for example, by Karmarkar’s algorithm. 4. Output hypothesis 9: x is positive if and only if x E h and CiJW , ~ Z ; X>~ 0; and halt.
5. We conclude with confidence 1-512 that fewer than a fraction €12 of examples lie in i. Thus we simply follow step 3; output hypothesis 9: z is positive if and only if ZzJ wijx,zJ > 0; and halt. 6. We conclude with confidence 1-612 that fewer than a fraction €12of examples are positive and thus simply output hypothesis: g ( x ) = 0 (i.e., classify all examples as negative) and halt.
Theorem 4. The class F of intersections of two half spaces is learnable from examples and membership queries. Proof. It is easy to see with confidence 1-612, as in the proof of Theorem 3, that if l ( b 1 ) occurs, then the probability of finding an example in i: is less than €12. Thus we are allowed to neglect all examples in ?, provided that we correctly classify a fraction 1 - ~ / of 2 all other examples. We then go to step 5, and find a quadratic classifer for the more than I%? examples we have in S?. By Theorem 1, as in the proof of Theorem 2, this gives us with confidence 1 - 6 / 2 a classifier that is €12 accurate. Thus overall if step ( b l ) is realized, we have confidence at least 1 - 6 of producing an €-accurate hypothesis. Likewise, if (b2) occurs, then with confidence 1- 612 at most a fraction 1 - €12 of examples are not in i , and thus at most a fraction ~ / 2of examples are positive, and we are justified in hypothesizing all examples are negative, which we do in step 6.
520
Eric B.
Baum
Assume now we reach step 2. The half space h found in step 2 contains, with confidence 1 - 6/2, all but a possible ~ / of 2 the probability for examples in i , by Theorem 1. The quadratic separator found in step 3 correctly classifies, by Theorem 1, with confidence 1 - S/2, all but a frac2 the measure not in ?. Note that there is no possibility that tion 1 - ~ / of the set 5, will fail to be quadratically separable since it has no members in ?. This is in contrast to the situation in the proof of Theorem 3, where we had only statistical assurance of quadratic separability. Putting this all together, we correctly classify with confidence 1- S/2 all but a fraction 4 2 of the measure in i , and with equal confidence all but the same fraction of measure not in i . It is easy to see that, if we use some polynomial time algorithm such as Karmarkar's for the linear programming steps, that this algorithm converges in polynomial time. Q.E.D.
6 Concluding Remarks
The task of learning an intersection of two half spaces, or of learning functions described by feedforward nets with two hidden units, is hard and interesting because a Credit Assignment problem apparently must be solved. We appear to have avoided this problem. We have solved the task provided the boundary planes go through the origin and provided either membership queries are allowed or one restricts to inversion symmetric distributions. This result suggests limited optimism that fast algorithms can be found for learning interesting classes of functions such as feedforward nets with a layer of hidden units. Hopefully the methods used here can be extended to interesting open questions such as: Can one learn an intersection of half spaces in the full distribution independent model, without membership queries?' Can one find robust learning algorithms for larger nets, if one is willing to assume some restrictions on the distrib~tion?~ We report elsewhere (Baum 1990b) on different algorithms that use queries to learn some larger nets as well as intersections of Ic half spaces, in time polynomial in n and k. It is perhaps interesting that our algorithm produces as output function a net with two different types of hidden units: one linear threshold and the other quadratic threshold. It has frequently been remarked that a shortcoming of neural net models is that they typically involve only one type of neuron, whereas biological brains use many different types of neurons. In the current case we observe a synergy in which the 8Extensionof our results to haIf spaces bounded by inhomogeneous planes is a subcase, since the inversion symmetry we use is defined relative to a point of intersection of the planes. 'Extension to intersections of more than two half spaces is an interesting subcase.
Polynomial Time Algorithm
52I
linear neuron is used to pick o u t only one of the two parabolic lobes that are naturally associated with a quadratic neuron. For this reason we are able readily to approximate a region defined as the intersection of two half spaces. It seems reasonable to hope that neural nets using mixtures of linear threshold a n d more complex neurons can be useful in other contexts.
Acknowledgments A n earlier version of this paper appeared in the ACM sponsored "Proceedings of the Third Annual Workshop on Computational Learning Theory," August 6-8, 1990, Rochester, NY, Association for Computing Machinery, Inc.
References Angluin, D. 1988. Queries and concept learning. Machine Learn. 2, 319-342. Angluin, D., and Valiant, L. G. 1979. Fast probabilistic algorithms for Hamiltonian circuits and matchings. J. Comput. Syst. Sci. 18, 155-193. Baum, E. B. 1989. A proposal for more powerful learning algorithms. Neural Comp. 1,201-207. Baum, E. B. 1990a. On learning a union of half spaces. 1. Complex. 6, 67-101. Baum, E. B. 1990b. Neural net algorithms that learn in polynomial time from examples and queries. l E E E Transactions in Neural Networks, in press. Blum, A. 1989. On the computational complexity of training simple neural networks. MIT Tech. Rep. MIT/LCS/TR-445. Blum, A., and Rivest, R. L. 1988. Training a 3-node neural network is NPcomplete. In Advances in Neural Information Processing Systems I , D. S. Touretzky, ed., pp. 494-501. Morgan Kaufmann, San Mateo, CA. Blumer, A., Ehrenfeucht, A,, Haussler, D., and Warmuth, M. 1987. Learnability and the Vapnik-Ckervonenkis Dimension. U.C.S.C. Tech. Rep. UCSC-CRL-8720, and 1. A C M 36(4), 1989, pp. 929-965. Karmarkar, N. 1984. A new polynomial time algorithm for linear programming. Combinatorica 4, 373-395. Khachian, L. G. 1979. A polynomial time algorithm for linear programming. Dokl. Akad. Nauk. USSR 244(5), 1093-1096. Translated in Soviet Mnfh. Dokl. 20, 191-194. Le Cun, Y. 1985. A learning procedure for asymmetric threshold networks. Pioc. Cognitiva 85, 599-604. Minsky, M., and Papert, S. 1969. Perceptrons, and lntroduction to Computational Geometry. MIT Press, Cambridge, MA. Ridgeway, W. C. I11 1962. An adaptive logic system with generalizing properties. Tech. Rep. 1556-1, Solid State Electronics Lab, Stanford University. Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, New York.
522
Eric B. Baum
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, eds. MIT Press, Cambridge, MA. Valiant, L. G. 1984. A theory of the learnable. Commun. ACM 27(11), 1134-1142. Valiant, L. G., and Warmuth, M. 1989. Unpublished manuscript. Wenocur, R. S., and Dudley, R. M. 1981. Some special Vapnik-Chervonenkis classes. Discrete Math. 33, 313-318.
Received 16 April 90; accepted 3 July 90.
Communicated by Scott Kirkpatrick and Bemardo Huberman
Phase Transitions in Connectionist Models Having Rapidly Varying Connection Strengths James A. Reggia Mark Edwards Department of Computer Science, Institute for Advanced Computer Studies, University of Maryland, College Park, M D 20742 U S A
A phase transition in a connectionist model refers to a qualitative change in the model’s behavior as parameters determining the spread of activation (gain, decay rate, etc.) pass through certain critical values. As connectionist methods have been increasingly adopted to model various problems in neuroscience, artificial intelligence, and cognitive science, there has been an increased need to understand and predict these phase transitions to assure meaningful model behavior. This paper extends previous results on phase transitions to encompass a class of connectionist models having rapidly varying connection strengths (”fast weights”). Phase transitions are predicted theoretically and then verified through a series of computer simulations. These results broaden the range of connectionist models for which phase transitions are identified and lay the foundation for future studies comparing models with rapidly varying and slowly varying connection strengths.
1 Introduction
It has recently been demonstrated that connectionist models (neural models) with large networks can exhibit “phase transitions” analogous to those occurring in physical systems (e.g., Chover 1988; Huberman and Hogg 1987; Kryukov and Kirillov 1989; Kurten 1987; Shrager et al. 1987). In other words, as parameters governing the spread of activation (decay rate, gain, etc.) pass through certain predictable critical values, the behavior of a connectionist model can change qualitatively. These results are important because they focus attention on this phenomenon in interpreting model results and because they provide specific guidelines for selecting meaningful parameters in designing connectionist models in applications. For example, it is often desirable for network activation to remain bounded in amplitude within a bounded portion of the Neural Computation 2, 523-535 (1990)
@ 1990 Massachusetts Institute of Technology
524
J. A. Reggia and M. Edwards
network and to exert an influence for a finite amount of time. The issue thus arises in large models of how to characterize qualitatively the spatiotemporal patterns of network activation as a function of network parameters. Previous results on phase transitions in connectionist models are limited in their applicability to networks whose weights are assumed to be essentially fixed during information processing (”static weights”). Although these previous results are often applicable to models where weights change very slowly (“slow weights,“ e.g., during learning), they cannot be applied directly in situations where connection strengths vary at a rate comparable to the rate at which node activation levels vary. This latter situation arises, for example, with competitive activation methods, a class of activation methods in which nodes actively compete for sources of network activation (Reggia 1985).’ Competitive activation mechanisms have a number of useful properties, including the fact that they allow one to minimize the number of inhibitory (negatively weighted) links in connectionist networks. They have been used successfully in several recent applications of interest in A1 and cognitive science, such as print-to-sound conversion (Reggia et al. 1988), abductive diagnostic hypothesis formation (Peng and Reggia 1989; Wald et al. 1989), and various constraint optimization tasks (Bourret et al. 1989; Peng et al. 1990; Whitfield et al. 1989). They currently are under investigation for potential neurobiological applications. This paper analyzes the phase transitions that occur in a class of connectionist models with rapidly varying connection strengths (“fast weights”). This class of models includes but is far more general than competitive activation mechanisms. Phase transitions are identified and shown to be a function of not only the balance of excitation and inhibitioddecay in a model, but also to depend in a predictable way on the specific form of the activation rule used (differential vs. difference equation). Computer simulations are described that demonstrate the theoretical predictions about phase transitions as well as other results. 2 General Formulation
A connectionist model consists of a set of nodes N representing various potentially active states with weighted links between them. Let a z ( t ) be the activation level of node a E N at time f , and let the connection strength to node a from node j be given by c,,(t). The “output“ to node 2 from node 3 at any time t is given by c Z J ( t )a,(t). . This can be contrasted with the output signal U U I., a~ l ( t ) that has generally been used in previous ’Competitive activation mechanisms usually have “resting weights” on connections that are fixed or slowly varying, but the actual effective connection weights/strengths of relevance are rapidly changing in response to node activation levels.
Phase Transitions in Connectionist Models
525
connectionist models, where w , is ~ essentially a fixed weight (except for slow changes occurring during learning). Unlike previous connectionist models, the connection strength values ctJ( t ) used here may be very rapidly varying functions of time. In general, c,,(t) # c,,(t). We assume that node fanout is substantial, that is, that each node connects to several other nodes. In many connectionist models, it is convenient to distinguish different types of interactions that can occur between nodes, and we do so here. Consider an arbitrary node J E N . Then the set of all nodes N can be divided into four disjoint subsets with respect to the output connections of node J : 1. S, =
{ J } , that is, node J receives a "self-connection." The connection strength c J J ( t )will be restricted to being a constant c,,(t) = cs, which may be either positive or negative.
2. P, = { k E N l c A J ( t )> 0 for all t } , that is, nodes receiving positive or excitatory connections from node j . The strengths of excitatory output connections from j to nodes in PJ may vary rapidly but are subject to the constraint (2.1)
where cp > 0 is a constant. 3. NJ = { k E NIck,(t) < 0 for all t } , that is, nodes receiving negative or inhibitory connections from node j . The strengths of inhibitory output connections from j to nodes in NJ may vary but are subject to the constraint
where cn < 0 is a constant. 4. 0, = { k E NlcBJ= 0 for all t } , that is, all other nodes in
N that do
not receive connections from node j . The parameters c,, cp, and c,, are network-wide constants that represent the gain on self, excitatory, and inhibitory connections, respectively. Let e , ( t ) represent the external input to node i at time t , let T , be a constant bias at node %, and let the dynamic behavior of the network
J. A. Reggia and M. Edwards
526
be characterized by the discrete-time recursion relation for activation of node i
where 0 < 6 '5 1 represents the fineness of time quantization. For sufficiently small values of 6, equation 2.3 numerically approximates the first-order differential equation
for reasonably well-behaved cii functions (Euler's method). On the other hand, if 6 = 1, for example, then equation 2.3 may in general behave qualitatively differently from equation 2.4. In this special situation, if we also restrict cZl(t)to be nonnegative and constant for all time, then equation 2.3 becomes the linear difference equation described in Huberman and Hogg (1987).2 Thus, equation 2.3 directly generalizes the formulation in Huberman and Hogg (1987) to handle rapidly varying connection strengths, to explicitly distinguish excitatory and inhibitory connection strengths, and to encompass models where activation levels are represented as first-order differential equations of the form given in equation 2.4. 3 Phase Transitions
The phase transitions of connectionist models based on equation 2.3 can be characterized in terms of the parameters cs, cp, and c,,. Define the total network activation at time t to be A ( t ) = CzEN at, and let c = cs+cp+c,. A value c > 0 indicates a model in which excitatory influences dominate in the sense that total excitatory gain exceeds total inhibitory gain. A value c < 0 indicates that inhibitory influences dominate. As in Huberman and Hogg (1987), we consider the case of constant external input values to individual nodes so that the total external input E ( t )is a constant E. Define R = CzEN r,.
Theorem 1. For networks governed by equation 2.3 with constant external input E , total network activation is given by
A ( t + 6) = (I + & ) A ( & ) + b(E + R )
(3.1)
'To see this let c, = -7,cp = a, cn = 0, T, = 0, 6 = 1, and c , ( t ) = aR,, where y and are the "decay" and "gain" parameters and R,? is a constant connection strength in Huberman and Hogg (1987). IY
Phase Transitions in Connectionist Models
527
Proof. By straightforward calculation from equation 2.3,
r
= =
+
+ + +
1
by equations 2.1 and 2.2
A ( t ) 6 [ E & cA(t)] (1 + bc)A(t) 6 ( E + R )
0
Corollary 1. Connectionist models based on equation 2.3 have total network activation given by
A ( f )= C
{
c
[l
+
1
A(0) (1 +
-
1
(3.2)
where A ( 0 ) is the initial total network activation. For very small values of 6 such that spread of network activation is effectively governed by the differential equation 2.4, we have c
C
A((i)]ect - 1)
(3.3)
in the limit as 6 + 0.
Corollary 2. Connectionist models based on equation 2.3 have total network activation that asymptotically approaches afixed point A" = -( E + R ) / c whenever -216 < c < 0. Proof. By equation 3.1, as t increases the behavior of A ( t ) is determined by ( l + h ~ ) ~ ' an ~ , infinite sequence that converges as t + m when Il-tbcl < 1, or -216 < c < 0. Letting A(t +S) = A ( t ) = A* in equation 3.2 gives the fixed point. 0 This last result indicates that when connectionist models using equation 2.3 converge, their total network activation approaches a fixed point whose value is independent of 6, the fineness of time quantization. This
528
J. A. Reggia and M. Edwards
does not, of course, imply that individual node values reach equilibrium nor that they even remain finite. For example, in networks with both excitatory and inhibitory connections it is possible that two nodes’ activation levels could grow without bound, one toward +m and one toward -m, in perfect balance so that the total network activation remains balanced. The second corollary above indicates that systems based on equation 2.3 have two phase transitions given by c = 0 and c = -216. The first phase transition is independent of 6 and indicates parameter values c = 0 where total excitatory and inhibitory gains are equal. This corresponds to a phase transition described in the fixed weight model of Huberman and Hogg (1987): However, the interpretation here is somewhat different because of the separation of excitatory (c,) and inhibitory (c,) influences, and the designation of c, as gain on a self-connection rather than decay. The corollary implies that inhibition must dominate (c < 0) for a connectionist model based on equation 2.3 to converge. If c > 0, that is, if excitation dominates, then the model will not converge on a fixed point: the ”event horizon” (Huberman and Hogg 1987) will in general grow indefinitely in time and space. The second phase transition, c = -216, indicates a limit on how much inhibition can dominate and still result in a convergent model. This phase transition was not encountered in Huberman and Hogg (1987) because of a priori assumptions concerning legal ranges of values for parameters (e.g., a ”decay rate” c, where -1 < c, < 0). These assumptions are not made here nor in many connectionist models [e.g., c, > 0, i.e., selfexcitation of nodes (Kohonen 1984), or c, < -1, which makes no sense if Ic,I is considered to be a ”decay rate” but which makes perfect sense as gain on a self-connection (Reggia et al. 1988)l. This second phase transition recedes in importance as 6 decreases in size, so that in the limit where equation 2.3 represents the first-order differential equation 2.4 this second phase transition no longer exists. Finally, we assumed that the parameters c, ,c, and c, are identical for each node in the above derivation as a matter of convenience. This is not essential: the same results are obtained with the relaxed condition that c, the sum of these three parameters, is the same for all nodes. 4 Simulations
Two sets of simulations verify the predicted phase transitions and fixed points and give information about the activation patterns that occur. Networks in these simulations always have IN1 = 100 nodes. In each ’With c, = -7, c, = (Y, and c,, = 0 we have ru/y = 1. We are essentially ignoring the connectivity parameter CL in Huberman and Hogg (1987) as their results concerning I/ carry over unchanged. Most connectionist models are concerned with values >1 and we have assumed earlier that this holds here.
Phase Transitions in Connectionist Models
529
simulation all nodes have the same value T , = r and a constant external input of 1.0 is applied to an arbitrary node ( E = 1.0). Simulations terminate either when IA(t) - A ( f - 6)1 < 0.001 [in which case the simulation is said to converge and A ( t ) at convergence is taken as the numerical approximation of the predicted value A"] or when for any arbitrary node 2 , la,(t)l > 1000 (in which case the simulation is said to diverge). All simulations were implemented using MIRRORS/II (DAutrechy et al. 19881, a general purpose connectionist modeling system, using double precision arithmetic on a VAX 8600. The first set of 24 simulations verify the c = 0 phase transition in networks with no internode inhibitory connections (cn = 0). This result is of special interest to us because of its relevance to other ongoing research involving competitive activation mechanisms (Reggia et al. 1990). Each node is randomly connected to 10 other nodes. A resting weight L L , ~between 0.0 and 1.0 is randomly assigned to each connection according to a uniform distribution (resting weight w , ~should not be confused with connection strength c,,), The dynamically varying connection strength c,] to node I from node J is determined by
which can be seen to satisfy equation 2.1. Equation 4.1 implements a competitive activation mechanism because each node j divides its total output among its immediate neighbors in proportion to neighbor node activation levels (see Reggia 1985; Reggia et al. 1988, 1990 for further explanation). Simulations have 6 = 0.1, r = 0.0, and initial node activations 1 x lop9 (to avoid divide-by-zero fault in equation 4.1). The value c is incrementally varied between simulations from -1.0 to -0.01 and from 0.01 to 1.0 by changing c, and cp. Figure 1 shows the total network activation at time of simulation termination as a function of c. The 12 simulations with c < 0 all converged on a fixed point with the numerically generated value A ( f ) within 1% of the theoretically predicted A* value - ( E + R ) / c . For example, with c = -0.8 the predicted A* = 1.25 was approximated by A ( t ) = 1.24 at convergence ( / A A ( t ) l< 0.001). The 12 simulations with e > 0 all diverged, that is, total network activation level grew without apparent bounds. Figure 2 shows the number of iterations R until convergence ( c < 0) or determination of divergence ( c > 0) plotted against c. The phase transition at c = 0 is again evident. The set of 24 simulations described above was run three times with the same results; each time different random connections and randomly generated weights on those connections were used. The second set of simulations introduces negative time-varying connection strengths and also examines the lower phase transition c = -216 as S varies. Each node in the network is connected to all other nodes, and resting weights are again randomly assigned according to a uniform
J. A. Reggia and M. Edwards
530
4500
-
4oooA
3500 3000 2500 -
m-
: 500 0
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
C
Figure 1: Total network activation A ( t ) at time of termination of simulations as a function of c. Below predicted phase transition at c = 0 all simulations converge on a value of A ( t ) close to predicted A* value (these nonzero values are too small to be seen precisely on the vertical scale used). Above predicted phase transition aI1 simulations diverge. The phase transition is quite crisp: at c = -0.01 convergence to 99.0 is seen (A* = 100.0 predicted), at c = 0.01, divergence is seen. The curve shown represents 24 simulations. Repeating all of these simulations two more times starting with different random connections and resting weights produces virtually identical results.
distribution. Now, however, resting weights lie between -1.0 and 1.0 so, on average, half of the internode connections are negative/inhibitory and half are positive/excitatory. The dynamically varying connection strength o n excitatory connections is again determined by equation 4.1; that on inhibitory connections is determined by (4.2) which satisfies equation 2.2 (note that c, < 0). A value T = 0.1 is used and each node’s initial activation level is set at ai(0) = - r / c . Three variations of the second set of simulations are done, where the only difference between the three models tested is the value of 6. The
Phase Transitions in Connectionist Models
531
5000 4500 -
I
4Ooo-
*
3500-
3000 2500 ZOO0 1500 1000 500 0 -1
I
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Figure 2: Phase transition at c = 0 in a network with rapidly varying connection strengths (c, = 0, 6 = 0.1, T = 0). Plotted are number of iterations n until either convergence (c < 0 ) or until divergence (c > 0) as a function of c. The curve shown represents 24 simulations.
three values of 6 used are 0.400, 0.500, and 0.666. For each value of 6, the phase transition at c = 0 is again verified in a fashion similar to that described above for the first set of simulations. In addition, the lower phase transition c = -2/6 is verified in an analogous fashion (see Fig. 3). The lower phase transition is observed to move as a function of delta exactly as predicted, with phase transitions at c = -5, -4, and -3 when b = 0.400, 0.500, and 0.666, respectively. In the second set of simulations, with both positive and negative connection strengths, the activation levels of individual nodes exhibit quite interesting behavior. As c gradually decreases between each simulation from zero towards the lower phase transition at -2/6, at first not only A ( t ) but also the maximum and minimum individual node activation levels steadily approach a fixed point. However, as c gets closer to the, lower transition point, with some simulations total network activation begins to oscillate. A ( t ) still spirals in on the predicted A* value, but the maximum and minimum activation levels of individual nodes grow arbitrarily large. Thus, although the predicted phase transitions and fixed points for A ( t ) are always confirmed, individual node activations might still diverge in the region near -2/6.
J. A. Reggia and M. Edwards
532
C
Figure 3: Variation of lower phase transition c = -216 as 6 varies. Shown here is total number of iterations n until convergence/divergence as a function of c. Values of 6 used are 0.400 (solid line; predicted phase transition at c = -5.01, 0.500 (dotted line; predicted phase transition at c = -4.01, and 0.666 (mixed dots and dashes; predicted phase transition at c = -3.0). 5 Discussion
With the growing use of large connectionist models, awareness of phase transitions is becoming increasingly important. Accordingly, this paper has characterized through both analysis and computer simulations the phase transitions in connectionist models using a class of activation rules described by equation 2.3. In the special case where 6 = 1 and weights are appropriate nonnegative constants, equation 2.3 reduces to the model considered in Huberman and Hogg (1987). In this special case the results obtained here are consistent with those in Huberman and Hogg (1987) and Shrager et al. (19871, although an additional phase transition, not reported previously, exists when a priori assumptions about parameter values are relaxed. In addition, equation 2.3 encompasses a wide range of additional models with constant connection strengths in which spread of activation is determined by differential equations (equation 2.41, and which have a wider range of parameters over which the model produces useful behavior (i.e., as S decreases the phase transition at c = -216 recedes in importance).
Phase Transitions in Connectionist Models
533
By allowing rapidly varying connection strengths, equation 2.3 is also directly applicable to connectionist models where inhibitory effects are produced not by static inhibitory connections but by dynamic competition for activation. Many associative networks in A1 and cognitive science theories do not have inhibitory connections, so implementing these theories as connectionist models raises the difficult issue of where to put inhibitory connections and what weights to assign to them (see Reggia 1985; Reggia et al. 1988, 1990 for a discussion). A competitive activation mechanism resolves this issue by eliminating inhibitory connections between nodes (c, = 0). According to the results in this paper, such models should require strong self-inhibitory connections to function effectively (e, < -Q. This conclusion is interesting because in some previous applications of competitive activation mechanisms it was observed that strong self-inhibition (c, 5 -2 when c4 = 1)was necessary to produce maximally circumscribed network activation (Reggia et aI. 1988; Peng et al. 1990). This empirical observation was not understood at the time. Although these connectionist models used activation rules comparable but somewhat more complex than equation 2.3, it seems reasonable to explain the more diffuse activation patterns seen with c, 2 -1 as occurring because the model was operating near a phase transition ( c = 0). As noted earlier, it is important to recognize that the convergence of A ( t ) on a fixed point when -2/6 < c < 0 does not guarantee that each individual node’s activation approaches a fixed point nor even that it is bounded. In this context it is interesting to note that as c approached the lower phase transition -2/S in networks with inhibitory connections (second set of simulations), individual node activation levels sometimes grew without apparent bounds. The value A ( t ) still approximated the predicted A* value because positive and negative node activations balanced one another. This raises the question of whether an additional phase transition between -216 and 0 exists, below which individual node activations might not be bounded. Such a phase transition might be derived, for example, if one could determine the convergence range for C ,af rather than for A = C , a, as was done in this paper. Such a determination is a difficult and open task. As a practical consideration, the problem of balanced divergent node activations can be minimized by using a value of T > 0, as in the second set of simulations. This has the effect of shifting node activation levels in a positive direction, thereby avoiding the balancing of positive and negative node activation levels. It should also be noted that the problem of unbounded growth of individual node activation in networks with fixed weights has been addressed elsewhere (Hogg and Huberman 1987, pp. 296-297). Introducing nonlinearities (e.g., hard bounds on node activation levels) can lead to saturation of node activation levels and moving wavefronts that propagate throughout a network. Finally, it should be noted that a variety of recent work with biologically oriented random/stochastic neural network models has also
534
J. A. Reggia and M. Edwards
identified phase transitions (e.g., Chover 1988; Kryukov a n d Kirillov 1989; Kurten 1987). This related work has focused on parameters not considered here, such as the probability of neuron firing o r neuron threshold. This related work as well as that of Huberman a n d Hogg (1987) and Shrager et al. (1987) a n d the results presented in this paper are collectively providing a better understanding of phase transitions in connectionist models across a broad spectrum of models a n d applications.
Acknowledgments
Supported i n part by NSF award IRI-8451430 a n d in part by NIH award NS29414.
References Bourret, P., Goodall, S., and Samuelides, M. 1989. Optimal scheduling by competitive activation: Application to the satellite antennas scheduling problem. Proc. Int. Joint Conf. Neural Networks, IEEE I, 565-572. Chover, J. 1988. Phase transitions in neural networks. In Neural Information Processing Systems, D. Anderson, ed., pp. 192-200. American Institute of Physics, New York, NY. DAutrechy, C. L., Reggia, J., Sutton, G., and Goodall, S. 1988. A general purpose simulation environment for developing connectionist models. Simulation 51, 5-19. Hogg, T., and Huberman, B. 1987. Artificial intelligence and large scale computation: A physics perspective. Pkys. Rep. 156(5), 227-310. Huberman, B., and Hogg, T. 1987. Phase transitions in artificial intelligence systems. Artificial Intelligence 33, 155-171. Kohonen, T. 1984. Self-Organization and Associative Memory, Ch. 5. SpringerVerlag, Berlin. Kryukov, V., and Kirillov, A. 1989. Phase transitions and metastability in neural nets. Proc. Int. Joint Conf. Neural Networks IEEE I, 761-766. Kurten, K. 1987. Phase transitions in quasirandom neural networks. Proc. Int. Joint Conf. Neural Networks IEEE 11, 197-204. Peng, Y., and Reggia, J. 1989. A connectionist model for diagnostic problem solving. I E E E Trans. Syst. M a n Cybernet. 19, 285-298. Peng, Y., Reggia, J., and Li, T. 1990. A connectionist solution for vertex cover problems. Submitted. Reggia, J. 1985. Virtual lateral inhibition in parallel activation models of associative memory. Proc. Ninth Int. Joint Conf. Artificial Intelligence, Los Angeles, CA, 244-248. Reggia, J., Marsland, P., and Berndt, R. 1988. Competitive dynamics in a dualroute connectionist model of print-to-sound transformation. Complex Syst. 2,509-547.
Phase Transitions in Connectionist Models
535
Reggia, J., Peng, Y., and Bourret, P. 7990. Recent applications of competitive activation mechanisms. In Neural Networks: Advances a i d Applications, E. Gelenbe, ed. North-Holland, Amsterdam, in press. Shrager, J., Hogg, T., and Huberman, B. 1987. Observation of phase transitions in spreading activation networks. Science 236, 1092-1094. Wald, J., Farach, M., Tagamets, M., and Reggia, J. 1989. Generating plausible diagnostic hypotheses with self-processing causal networks. J . Exp. Tlzeor. Artificial Intelligence 1,91-112. Whitfield, K., Goodall, S., and Reggia, J. 1989. A connectionist model for dynamic control. Telernatics lrzformatics 6, 375-390.
Received 22 May 90; accepted 10 August 90.
Communicated by Gunther Palm
An M-ary Neural Network Model R. Bijjani P. Das Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180-3590 USA
An M-ary neural network model is described, and is shown to have a higher error correction capability than the bipolar Hopfield neural net. The M-ary model is then applied to an image recognition problem.
1 Introduction Living organisms are very efficient in performing certain tasks, such as pattern recognition and adaptive learning. The increase in demand for the development of a machine capable of performing similar tasks caused an acceleration in the pace of brain research. The last four decades have yielded many theories and mathematical models loosely describing the operation of the neurons - the basic computational element and the building block of the brain. These models came to be collectively known as the neural network models. All neural nets share in the common characteristic of processing information in a largely collective nature as opposed to the predominantly serial manner of conventional computers. This collective nature arises from the complex structure of massively interconnected nerve cells or neurons. The brain stores information by modifying the strengths of the interconnections, or synaptic weights, between the neurons. The neurons themselves preserve no information; their role is limited to act as simple nonlinear decision-making elements. Memory or recognition is the information retrieval process in which the brain categorizes incoming information by matching it up with its stored data. For successful recognition, the incoming information must present a cue or association to the desired stored image and hence the term associative memoy. Neural nets can be classified according to their desired functions (associative memory, classifiers, etc.), the characteristics of their processing elements, the network layout or topology, and the learning rules (Athale 1987; Lippman 1987). This paper describes an M-ary extension of the Hopfield model (Hopfield 1984). The statistical properties of the new model are calculated from the point of view of error probabilty and error correction capability. The Neural Computation 2, 536-551 (1990) @ 1990 Massachusetts Institute of Technology
M-ary Neural Network Model
537
M-ary model's performance is then compared to that of the bipolar Hopfield model and is found to possess a higher error correction capability. Yet the Hopfield model is determined to possess a faster convergence rate for the case of moderately (less than about 40%) corrupted signals. Lastly, an application in image processing using the M-ary model is presented. The experimental outcomes are found to be in agreement with the theory presented. 2 The M-ary Model
The M-ary neural network is an associative memory system capable of discerning between a highly distorted version of a stored vector from other retained information. The network consists of N neurons, where each neuron can exist in one of M distinct states. The fundamental algorithm for an M-ary system is as follows: a set of p M-ary vectors, each of length N , are to be stored for future recollection. The vector components are represented by which signifies the ith component of the pth vector. The network is assumed to be globally interconnected, that is all neurons are painvise connected by synaptic weights. The vectors are stored collectively by modifying the values of the synaptic matrix J , whose elements Jt3 represent the weights of the interconnections between the elements or nodes z and j . JzJ is computed as the outer product of the stored vectors as follows:
(r,
where i, j = 1 , 2 , . . . , N . For satisfactory performance, we require that the components of the stored vectors be independent and identically distributed, that is, the vector components are permitted to take any of the equiprobable values ak drawn from an alphabet of size M . In other words:
1 (2.2) M where i = 1 , 2 , . . . , N ; p = 1,2,.. . , p ; and k = 1 , 2 , . .. , M . A content addressable memory system recovers a stored vector [*, which it most closely associates with the input vector. We shall examine two likely occurrences. The first is when the system is initially presented at time t = to with an input vector S(t0) contaminated with additive white gaussian noise. Where the jth element P{[:
=ak} =
-
the noise component nJ(to) being a zero-mean gaussian random variable with a power spectral density N0/2. The second occurrence is when the
R. Bijjani and I? Das
538
Figure 1: Multilevel threshold operation. input vector S(t0) differs from the desired vector tUin d elements and is identical to tUin the remaining ( N - d ) elements. Let the output after some time t 2 t o be S(t + l), with the index ( t 1) representing the time at one neural cycle time (iteration) following the discrete time t. The retrieval process does not require any synchronization. The individual neurons observe the following routine in a rather random and asynchronous manner.
+
Si(t
+ I ) = f [ h i ( t+ I)]
(2.3)
where
and where the activation function f(h)shown in Figure 1 represents a multilevel thresholding operation defined as
f(h)= ak
if and only if
w-1
+ Q'R 2
a k
+
akfl
(2.5)
M-ary Neural Network Model where the values threshold values:
0 ~for .
539
all A. = 1.2, . . . , A1 are appropriately defined
x:,
with 00 = --M and c l ~ ~=+ +oo, ~ and where a: = l / A l a:. This procedure is iteratively repeated until S ( t + 1) = S ( t ) . The choice of a specific set {ak-)of possible neuron outputs is essential for the quantitative analysis of the systems statistics. Here, and without any loss of generality, an alphabet is selected where the elements are evenly separated by a distance of r/ and have a mathematical mean of zero. In other words, (7,
=
Al
-~
The separation u = "/,f
-
2 u
A12
1 u
and
nA =
n1+ 1 - 2k -
2
11
can be expressed in terms of u ~to , be -
1
If another alphabet is chosen where I/ now represents the minimum separation between all adjacent letters, then a different relationship would exist between u and n,, but the rest of the analysis would not change. Let the average probability of having any of the vector components in error at any time I > to be denoted as f'(/). Where I N P ( f )= - x P r o b [ S , ( t )# N ,=I
(f]
and (" represents the stored vector. Note that the Prob[S,(I) # <]: is independent of I , because the [ y s are independent and identically distributed random variables as in equation 2.2. After the first iteration we compute P ( f o+1) = -2
1 '1- 1
nr
w
where Q signifies the complementary error function. The derivation of equation 2.7 is presented in Appendix A. For the case where the noise is zero, equation 2.7 reduces to
R. Bijjani and P. Das
540
Equation 2.8 describes the error probability due only to the internal interference among the stored vectors. If a network's size is fixed, that is, if N is a constant, then the error probability increases with increasing p. This produces an inherent limitation on the successful operation of the associative memory neural network. As the system attempts to store more vectors, its retrieving capability weakens as the error probability increases. A large value for the bit error probability signifies that the network may not converge to the desired stored state even when presented with an uncorrupted version of that same state. The maximum number of stored vectors permitted is what we shall refer to as the sforage capacity. For normal operation, select N , p , and M in such a way so that P(t0 1) << 1. After the second iteration, we compute the following upper bound for the error probability:
+
P(t + 1) < 2-
-'Q{ [lM
(2.9)
where
Z ( t )=
3 ( N - 1)'/(M2 1) cy + p P ( t ) yPZ(t)
B
M 5 ( M - l)(M'-
~
(2.10)
+
and
=
1)
b ( M - 2)(M - 3)
Y
=
-?(=)
M
2
+ 2(2M
-
1)(M+ 3) -
N
(2.11)
Equations 2.7-2.11 are derived in Appendix B. Notice that for the case when P ( t ) = 0, then the situation is equivalent to the case when the noise is zero at time to, and equation 2.9 reduces to equation 2.8. Equations 2.7-2.11 convey a considerable amount of information concerning the error probability, the storage capacity, and the error correction capabirify of the system. The storage capacity Q, is defined here to represent the ratio of the maximum possible number of stored vectors p,, to the vector length N , before the system converges to within a certain preset error probability. We set the error probability to a certain value and then using equation 2.7 we compute the storage capacity CP = p,,,/N versus the noise spectral density NO. In Figure 2a, we plot CP vs. No for different values of P(to 1) [P(to + 1) = lo-', lo-', and lop4], first for M = 2 and then for M = 4. Note that the storage capacity decreases with increasing noise and with decreasing error probability. In
+
M-ary Neural Network Model
541
M=2 1: P = lo-'
2: P = 10-2 3: P = 10+ 4: P = lo-'
0% .
I
0
I
I
I
10I
N O
Figure 2: (a) Storage capacity vs. noise for M = 2 and M
= 4.
other words, the network is capable of storing fewer vectors, if they are to be retrieved with higher precision. Note that a lower error probability after the first iteration means that the system would require fewer iterations for convergence as is evident from equation 2.9. Consequently the storage capacity also decreases with faster convergence. As for the effect of the alphabet size: in Figure 2b, we again plot vs. No, this time using M as a parameter instead of P(t0 I). Note that pmaxdecreases with increasing M . Figure 2c shows the error probability P(t0 + 1) vs. the storage capacity for M = 2 and M = 4. Here we again see that the error increases with the storage capacity and the alphabet size. Figure 2d shows P ( t o+ 1) vs. NO. Here again, the error increases with increasing noise and alphabet size. The convergence of this model, although guaranteed, does not ensure convergence to the desired vector (Hopfield 1984). We have correct convergence iff P(t + 1) < p ( t ) . The error correction capability is defined as the maximum number of elements that could be wrong at any one time, and still have the system accurately converge to the desired value. Define the Hamming distance d ( t ) between the two vectors S ( f ) and as the total number of elements in which these vectors differ. Utilizing equation 2.9 and B.lO, and the condition d ( t + 1) 5 d ( f ) , we compute an upper bound for the error correction capability do of [ ( M - l ) / M ] i V . In other words the system would be able to correctly recognize an input vector distorted by up to do components, where
+
e",
d(t)
< do =
M-1 ~
111
N
(2.12)
For any two independent vectors, the probability of having their j t h components different is ( M - l)/A4, and the expected value of the Hamming
R. Bijjani and P. Das
542
P = 0.1
20%
-
0%
-
I
I
1
0
1
10
N O
P = 0.001 10% -
-
0%
I
I
1
No
NO
Figure 2: (b) Storage capacity vs. noise for fixed error probability. distance is do = [(A4- l ) / M ] N .Therefore if the input vector differs from the desired stored vector by less than do, on the average, then accurate recollection is possible, since the input would more closely resemble the desired vector than all the others. Hence equation 2.12. Assume that E { d ( t ) }= NP(t)
(2.13)
These assumptions are valid if the elements of S ( t ) can be considered independent for all time t. In general this is not the case, since the elements taking the values of al and a M suffer only one-half the distortion expected by the other elements. An exact expression for the statistics of d(t) can be calculated taking into account the nonindependence assumption. However for proper operation we require that P ( t + 1) 5 P ( t ) ,or
M-ary Neural Network Model
(c’
543
p;m
1 xlo-”
0%
15%
Figure 2: (c) Error probability vs. storage capacity. (d) Error probability vs. noise. that P ( t ) + 0 as f 4 m. Under this assumption of P ( t ) << 1, which is required for convergence, the elements of S ( t )can be approximated to be independent, and d ( t ) can be assumed to have a binomial distribution. In Figure 3 we plot the normalized average Hamming distance E { d ( t ) / N }= P ( t ) vs. t, using equation B.lO. For numerical purposes we assume that N = 144 and p = 6. Note that for d ( t f l ) < do the system correctly converges and d ( t ) + 0. Figure 3a shows a Hopfield model operating near its theoretical capacity of 50%, while Figure 3b shows an ill-ary model, with ill = 4 operating near its capacity of 75%. One also observes that the number of iterations required steadily increases as the starting number of errors at time t o approaches do. When d ( t o ) = do, the network oscillates between independent states or converges to an incorrect or spurious state, and on the average d ( t ) = do for all time. When the network starts with more errors than do, then statistically the system does not necessarily converge to the correct state, but to some other state, a spurious state, or just oscillates between two or more states. From Figure 3c, observe that while the bipolar model fails to reconstruct signals by more than 50%, the 4-ary model does so with, on the average, only two
R. Bijjani and P. Das
544 M=2
0
t
Figure 3: Theoretical average Hamming distance vs. time. or three iterations. With an initial distortion of 40%, both the binary and the 4-ary models successfully reconstruct the signal with only a few iterations as can be shown in Figure 3d. Notice that the binary model offers a slightly faster convergence than the 4-ary model. So although the M ary model offers superior performance for signals distorted by more than 50%, the Hopfield model is more advantageous for less distorted signals. An experimental value of around 43% was determined as the threshold above which a 4-ary system outperforms the bipolar one. Therefore if a system is to operate in a low noise environment, the Hopfield model is recommended as it offers faster convergence and more storage capacity at a lower total cost. We conducted an experiment in image recognition, using the above model for A4 = 2 and for A4 = 4. In the experiment a CCD camera is used to supply analog video signals of the images to be saved to a low cost frame-grabber (Data Translation, model DT-2851). In our experiment, the frame-grabber divides the image into a grid of 12 x 12 pixels (N = 144) and its sampling circuitry then measures the average brightness of each pixel and converts it into a 1-bit ( M = 2) or a 2-bit (A4= 4) binary value. The setup can convert images into 512 x 512 x 8 bits pixels (N = 256K and M = 29. Figure 4 describes the normalized Hamming distance versus time for the situation where d(t0) = 0.7N. The experimental results, representing the average performance of 10 experiments with different initial states, is plotted against the upper-bounded theoretical curve of equation 2.9.
M-ary Neural Network Model
545
1
2
Theoretical o Experimental
-X-
ECdl-
0
x\
:\6\,
-
a
V
Y
V
V
Figure 4: Experimental average Hamming distance vs. time.
Original Binary Image
Original 4-ary Image
Figure 5: (a) Original test images. The results confirm that equation 2.9 presents a reasonably tight upper bound. Figure 5a represents one of the stored images, showing both the binary and the 4 gray-level original images. Figure 5b shows these same images after they were distorted by 50%. We achieved this distortion by randomly changing the value of each pixel with a 50% probability. For the 4-ary image, correct convergence was achieved after only one iteration, while for the binary image, the network converged after four iterations to an incorrect state. Again in Figure 5c, the 4-ary image was distorted by 70% and was then successfully reconstructed after four iterations. Note that for h f = 2 the system recognizes a signaI distorted
R. Bijjani and P. Das
546
Binary Image with 50 % Distortion
After 4 Iterations, System Converges to Incorrect State
4913- Image with 50% Distortion
ARer 1 Iteration, System Converges to Correct State
Figure 5: (b) Binary vs. M-ary results with 50.
by 50% or less, while for A4 = 4 a signal distorted by up to 75% was successfully retrieved, as predicted by equation 2.12. A one-dimensional holographic implementation of the Hopfield model is presented and discussed in White and Wright (1988). Here a modified version of the model in White and Wright (19881, which can be applied to two-dimensional A4-ary signals, is presented and is shown for the case of a 2 x 2 image ( N = 4) in Figure 6. A light beam xJ proportional in amplitude to SzJ,hits the zjth holographic element and is dispersed as in Figure 6a, which shows the effect of only one element. The values for J are computed as in equation 2.1 and are used in the construction of the zgth holograms. Figure 6b shows the cumulative effect of having all the holographic elements simultaneously responding to the vector V. The output in Figure 6b, which represents the intermediate vector h, is electronically detected and operated on by a multithresholding function as in equation 2.5, and is then fed back into an LED array, which gives rise to the new input in Figure 6.
A f - a r y Neural Network Model
547
4-ary Image, with 70%Distortion
Image After Third Iteration
After 4 Iterations, System Converges to Correct State
Figure 5: (c) M-ary image with 70.
Figure 6: Holographic implementation of the M-ary model.
Summary The M-ary model presents a novel technique for recovering heavily distorted images. The increase in error correction capability comes, however,
548
R. Bijjani and P. Das
at the expense of decreased storage capacity and a slight decrease in the convergence rate. Therefore, in a practical system, the error correction capability should be weighed against the storage capacity, and the desired convergence rate, before the size of the alphabet could be decided. As for the particular alphabet to be used, an equally spaced alphabet is recommended and used here. Finally, a two-dimensional holographic implementation is presented.
Appendix A Derivation of Equation 2.7 As described before, we can write P(t0 + 1) as P(t0
+ 1) = Prob [Si(to+ 1) #
Under the assumptions of independence and identical distribution, a p plying equation 2.5 to the above equation we get
P(to+l)
=
-c
l M k=l
64.1) Therefore, in order to determine P(t0 l), we need to determine the statistical properties of hz(to+ 1). With S ( t o ) = Cu + n, equation 2.4 becomes
+
which can be expressed as
For convenience let the first term in equation A.3 be denoted as f t, the second as st, and the third term as tt. hi(to+l) = ft+st+tt. The statistical mean of f t is computed to be zero, since the vector components of the stored vectors are independent with a zero mean. Similarly for tt, the expected value ( E { t t } )is equal to zero. As for the second term,
E{st} =
N-1 N aft;
~
Then,
Next we find the variances. For the first term in equation A.3, var{ft} =
(N
-
l ) ( P - 1) N2 o~(a~-ta~>
M-ary Neural Network Model
549
as for the second term, 4 M2 -4N - 1 var{st} = -5 M~ 2 - 1_ N 2_ 0:
_
and for the third term,
Therefore the variance of h,(to
+ 1) is
The error probability is hence calculated to be P(to+l) = 1-
-c
l M t=l
with O: = N0/2, equation A.6 reduces to equation 2.7. Q.E.D. Note that in the above derivation we assume that the vector components of the stored vectors are independent and identically distributed as in equation 2.2.
Appendix B Derivation of Equations 2.9-2.11 After the second iteration and for t as
2 to+2, equation A.l can be rewritten
l M a k - 1 fa k a k f a k + l P ( t ) = - CProb[h,(t) 4 ( 2 ' 2 IM k = l
= ak]
(B.1)
Here again the statistical properties of h ( t ) are needed in order to determine the error probability P(t). After performing the thresholding operation, S(t) would be composed of either correct or incorrect components with discrete values ak instead of the continuous valued S(to)= tU +n. Let d ( t ) denote the total number
550
R. Bijjani and P. Das
of components in error at time t. d ( t ) is a random variable and is assumed to have a binomial distribution, where
E { d ( t ) }= N P ( t - 1)
03.2)
var{cl(t)} = N P ( t - 1)[1- P ( t - I)]
03.3)
and
hi(t) can be written as,
E{pjd(t - 1) = d } = (N- q ( 1 - L&) M - I N u: and E { T }= 0. Therefore, N-1 N
M M-1
03.5)
The variance of h i ( t )can be expressed as var{h,(t)} = (uf uT can
+ uia:)/N2
be easily computed to be
fff
=
(N- l ) ( p - l)a,"
After considerable algebra we determine u,,,
M2-4
(B.8)
where m(4) a =
3 3M2-7 Ca4,=-.-
l M k=l
5 M2-1
0:
and'I'/ and y are as defined in equation 2.11. Substituting the values of uT and up from equations B.7 and B.8 into equation B.6 we get var{hi(t)) = ui = a: [e+ P P ( ~ 1) + r
~ (- tI)]
N2
Again the values a, 0,and y are the same as in equation 2.11.
(B.9)
M-ary Neural Network Model
551
Proceeding as in equation A.6, we evaluate the error probability at time f :
c [Q I +Q [
+
&[1+ Al(2-
L "'-I =X+I
&[l
-
!I-1
A1
M(2-
-
1
-
1 ) P ( f- l)]]
k-1 - 1 ) P ( t - I)]]]} N -1
(8.10)
Z is defined in equation 2.10. Equation 8.10 represents a n exact expression for P(f). In order to reduce the amount of calculations involved, we obtain a n upper bound for P ( t ) by noting that Q [ A ( l F ) ] 5 Q[A(l - F ) ] , VA? c 2 0. Therefore,
+
2 P ( t ) 5 - {Q [&[1
nl
-
iIlP(t - I)]]
References Athale, R. A. 1987. Neural net models and optical computing. Notes from a course presented at the OSA Annual meeting, Rochester, NY, October. Lippman, R. P. 1987. An introduction to computing with neural nets. ZEEE ASSP Mag. April, 4-22. Hopfield, J. J. 1984. Neural networks and physical systems with emergent collective computational abilities. Proc. Nntl. Acnd. Sci. U.S.A. 81,2554-2558. White, H. H., and Wright, W. A. 1988. Holographic implementation of a Hopfield model with discrete weightings. Appl. Opt. 27(2), 331-338.
Received 6 December 89; accepted 10 August 90.
552
Index Volume 2 By Author Adams, J. L. A Complementarity Mechanism for Enhanced Pattern Processing (Letter)
2 0 ):58-70
Akers, L. A. - See Rao, A. Arndt, M. - See Eckhorn, R. Atick, J. J. and Redlich, A. N. Towards a Theory of Eahy Visual Processing (Letter) 2(3):308-320 Bachrach, J. - See Mozer, M. C. Baldi, P. and Meir, R. Computing with Arrays of Coupled Oscillators: An Application to Preattentive Texture Discrimination (Letter)
2(4):458471
Ballard, D. H. - See Whitehead, S. T. Baum, E. B. The Perceptron Algorithm is Fast for Nonmalicious Distributions (Letter)
2(2):248-260
Baum, E. B. A Polynomial Time Algorithm That Learns Two Hidden Unit Nets (Letter)
2(4):510-522
Baxt, W. G. Use of an Artificial Neural Network for Data Analysis in Clinical Decision-Making: The Diagnosis of Acute Coronary Occlusion (Letter) 2(4):450-489 Bijjani, R. and Das, P. An M-ary Neural Network Model (Letter)
2(4):536-55 1
Brady, M. J. Guaranteed Learning Algorithm for Network with Units Having Periodic Threshold Output Function (Note)
2(4):405-408
Buhmann, J. - See Wang, D. Chernjavsky, A. and Moody, J. Spontaneous Development of Modularity in Simple Cortical Models (Letter)
2(3):334-354
Index
553
Clark, L. T. - See Rao, A. Collins, D. R. - See Katz, A. J. DArgenio, D. Z. - See Shadmehr, R. Das, P. - See Bijjani, R. Dayan, P. - See Willshaw, D. Denker, J. S. - See Schwartz, D. B. Devos, M. and Orban, G. A. Modeling Orientation Discrimination at Multiple Reference Orientations with a Neural Network (Letter)
2(2):152-161
Dicke, P. - See Eckhorn, R. Douglas, R. J. and Martin, K. A. C. Control of Neuronal Output by Inhibition at the Axon Initial Segment (Letter)
2(3):283-292
Eckhorn, R., Reitboeck, H. J., Arndt, M., and Dicke, P. Feature Linking via Synchronization among Distributed Assemblies: Simulations of Results from Cat Visual Cortex (Letter)
2(3):293-307
Edwards, M. - See Reggia, J. A. Fang, Y. - See Lockery, S. R. Fang, Y. and Sejnowski, T. J. Faster Learning for Dynamical Recurrent Backpropagation (Note)
2(3):270-273
Frean, M. The Upstart Algorithm: A Method for Constructing and Training Feedforward Neural Networks (Letter)
2(2):19%209
Gately, M. T. - See Katz, A. J. Gelenbe, E. Stability of the Random Neural Network Model (Letter)
2(2)~239-247
Georgiopoulos, M. - See Heileman, G. L. Georgiopoulos, M., Heileman, G. L., and Huang, J. Convergence Properties of Learning in ART1 (Letter)
2(4):502-509
554
Grajski, K. A. and Merzenich, M. M. Hebb-Type Dynamics is Sufficient to Account for the Inverse Magnification Rule in Cortical Somatotopy (Letter)
Index
2(1):71-84
Grondin, R. 0. - See Rao, A. Grzywacz, N. M. - See Vaina, L. M. Harper, J. S. - See Hollis, P. W. Hartman, E. - See Peterson, C. Hartman, E., Keeler, J. D., and Kowalski, J. M. Layered Neural Networks with Gaussian Hidden Units as Universal Approximations (Letter) 2(2):210-215 Heeger, D. J. and Jepson, A. Visual Perception of Three-Dimensional Motion (Letter)
2(2):129-137
Heileman, G. L. - See Georgiopoulos, M. Heileman, G. L., Papadourakis, G. M., and Georgiopoulos, M. A Neural Net Associative Memory for Real-Time Applications (Letter)
2(1):107-115
Hinton, G. E. and Nowlan, S. J. The Bootstrap Widrow-Hoff Rule as a Cluster-Formation Algorithm (Letter)
2(3):355-362
Hollis, P. W., Harper, J. S., and Paulos, J. J. The Effects of Precision Constraints in a Backpropagation Learning Network (Letter)
2(3):363-373
Huang, J. - See Georgiopoulos, M. Hummel, R. A. - See Zucker, S. W. Irino, T. and Kawahara, H. A Method for Designing Neural Networks Using Nonlinear Multivariate Analysis: Application to Speaker-Independent Vowel Recognition (Letter)
2(3):38&397
Iverson, L. - See Zucker, S. W. Izui, Y. and Pentland, A. Analysis of Neural Networks with Redundancy (Letter)
2(2):226-238
555
Index
Jepson, A. - See Heeger, D. J. Ji, C., Snapp, R. R., and Psaltis, D. Generalizing Smoothness Constraints from Discrete Samples (Letter)
2(2):18&197
Katz, A. J., Gately, M. T., and Collins, D. R. Robust Classifiers without Robust Features (Letter)
2(4):472479
Kawahara, H. - See Irino, T. Keeler, J. D. - See Hartman, E. Keeler, J. D. - See Peterson, C. Kowalski, J. M. - See Hartman, E. Kruglyak, L. How to Solve the N Bit Encoder Problem with Just Two Hidden Units (Note)
2(4):399401
LeMay, M. - See Vaina, L. M. Lockery, S. R., Fang, Y., and Sejnowski, T. J. A Dynamic Neural Network Model of Sensorimotor Transformations in the Leech (Letter)
2(3):274-282
MacKay, D. J. C. and Miller, K. D. Analysis of Linsker’s Simulations of Hebbian Rules (Letter)
2(2):173-187
Martin, K. A. C. - See Douglas, R. J. Meir, R.
-
See Baldi, P.
Menon, V. - See Tang, D. S. Merzenich, M. M. - See Grajski, K. A. Miller, K. D. - See MacKay, D. J. C. Miller, K. D. Derivation of Linear Hebbian Equations from a Nonlinear Hebbian Model of Synaptic Plasticity (Letter)
2(3):321-333
Moody, J. - See Chernjavsky, A. Mozer, M. C. and Bachrach, J. Discovering the Structure of a Reactive Environment by Exploration (Letter) Nowlan, S. J. - See Hinton, G. E.
2(4):447-457
556
Obradovic, Z. and Yan, P. Small Depth Polynomial Size Neural Networks (Note)
Index
2(4):402-404
Orban, G. A. - See Devos, M. Orfanidis, S. J. Gram-Schmidt Neural Nets (Letter)
2(1):116-126
Papadourakis, G. M. - See Heileman, G. L. Paulos, J. J. - See Hollis, P. W. Peng, J. - See Williams, R. J. Pentland, A. - See Izui, Y. Peterson, C. Parallel Distributed Approaches to Combinatorial Optimization: Benchmark Studies on Traveling Salesman Problem (Review) Peterson, C., Redfield, S., Keeler, J. D., and Hartman, E. An Optoelectronic Architecture for Multilayer Learning in a Single Photorefractive Crystal (Letter)
2(3):261-269
2(1):25-34
Psaltis, D. - See Ji, C. Rao, A., Walker, M. R., Clark, L. T., Akers, L. A., and Grondin, R. 0. VLSI Implementation of Neural Classifiers (Letter)
2(1):3543
Redfield, S. - See Peterson, C. Redlich, A. N. - See Atick, J. J. Reggia, J. A. and Edwards, M. Phase Transitions in Connectionist Models Having Rapidly Varying Connection Strengths (Letter)
2(4):523-535
Reitboeck, H. J. - See Eckhorn, R. Samalam, V. K. - See Schwartz, D. B. Saund, E. Distributed Symbolic Representation of Visual Shape (Letter)
2(2):138-151
Index
557
Schwartz, D. B., Solla, S. A., Samalam, V. K., and Denker, J. S. Exhaustive Learning (Letter)
2(3):374-385
Shadmehr, R. Learning Virtual Equilibrium Trajectories for Control of a Robot Arm (Letter)
2(4):43f346
Shadmehr, R. and DArgenio, D. A Neural Network for Nonlinear Bayesian Estimation in Drug Therapy (Letter)
2(2):216-225
Sejnowski, T. J.
-
See Fang, Y.
Sejnowski, T. J. - See Lockery, S. R. Snapp, R. R. - See Ti, C. Solla, S. A. - See Schwartz, D. B. Tang, D. S. and Menon, V. Temporal Differentiation and Violation of Time-Reversal Invariance in Neurocomputation of Visual Information (Letter)
2(2):162-172
Vaina, L. M., Grzywacz, N. M., and LeMay, M. Structure from Motion with Impaired Local-Speed and Global Motion-Field Computations (Letter)
2(4):420435
von der Malsburg, C. - See Wang, D. Walker, M. R. - See Rao, A. Wang, D., Buhmann, J., and von der Malsburg, C. Pattern Segmentation in Associative Memory (Letter)
2(1):94-106
Whitehead, S. D. and Ballard, D. H. Active Perception and Reinforcement Learning (Letter)
2(4):409419
Williams, R. J. and Peng, J. An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories (Letter)
2(4):490-501
Willshaw, D. and Dayan, P. Optimal Plasticity from Matrix Memories: What Goes Up Must Come Down (Letter)
2(1):85-93
Yan, I? - See Obradovic, Z .
558
Yuille, A. L. Generalized Deformable Models, Statistical Physics, and Matching Problems (Review) Zucker, S. W., Iverson, L., and Hummel, R. A. Coherent Compound Motion: Corners and Nonrigid Configurations (Letter)
Index
2(1):1-24
2(1):44-57