No title - PDF Free Download

Communicated by Christopher Bishop Flat Minima Sepp Hochreiter ¨ Informatik, Technische Universit¨at Munchen, ¨ ¨ Faku...

Author: MIT Press

9 downloads 611 Views 25MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Communicated by Christopher Bishop

Flat Minima Sepp Hochreiter ¨ Informatik, Technische Universit¨at Munchen, ¨ ¨ Fakult¨at fur 80290 Munchen, Germany

Jurgen ¨ Schmidhuber IDSIA, Corso Elvezia 36, 6900 Lugano, Switzerland

We present a new algorithm for finding low-complexity neural networks with high generalization capability. The algorithm searches for a “flat” minimum of the error function. A flat minimum is a large connected region in weight space where the error remains approximately constant. An MDL-based, Bayesian argument suggests that flat minima correspond to “simple” networks and low expected overfitting. The argument is based on a Gibbs algorithm variant and a novel way of splitting generalization error into underfitting and overfitting error. Unlike many previous approaches, ours does not require gaussian assumptions and does not depend on a “good” weight prior. Instead we have a prior over inputoutput functions, thus taking into account net architecture and training set. Although our algorithm requires the computation of second-order derivatives, it has backpropagation’s order of complexity. Automatically, it effectively prunes units, weights, and input lines. Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, flat minimum search outperforms conventional backprop, weight decay, and “optimal brain surgeon/optimal brain damage.” 1 Basic Ideas and Outline Our algorithm tries to find a large region in weight space with the property that each weight vector from that region leads to similar small error. Such a region is called a flat minimum (Hochreiter and Schmidhuber 1995). To get an intuitive feeling for why a flat minimum is interesting, consider this: A sharp minimum (see Fig. 2) corresponds to weights that have to be specified with high precision. A flat minimum (see Fig. 1) corresponds to weights, many of which can be given with low precision. In the terminology of the theory of minimum description (message) length (MML, Wallace and Boulton 1968; MDL, Rissanen 1978), fewer bits of information are required to describe a flat minimum (corresponding to a “simple” or low-complexity network). The MDL principle suggests that low network complexity corresponds to high generalization performance. Similarly, the standard Bayesian Neural Computation 9, 1–42 (1997)

c 1997 Massachusetts Institute of Technology °

2

Sepp Hochreiter and Jurgen ¨ Schmidhuber

Figure 1: Example of a flat minimum.

Figure 2: Example of a sharp minimum.

view favors “fat” maxima of the posterior weight distribution (maxima with a lot of probability mass; see, e.g., Buntine and Weigend 1991). We will see that flat minima are fat maxima. Unlike, e.g., Hinton and van Camp’s method (1993), our algorithm does not depend on the choice of a “good” weight prior. It finds a flat minimum by searching for weights that minimize both training error and weight precision. This requires the computation of the Hessian. However, by using an efficient second-order method (Pearlmutter 1994; Møller 1993), we obtain conventional backpropagation’s order of computational complexity. Automatically, the method effectively reduces numbers of units, weights, and

Flat Minima

3

input lines, as well as output sensitivity with respect to remaining weights and units. Unlike simple weight decay, our method automatically treats and prunes units and weights in different layers in different reasonable ways. The article is organized as follows: • Section 2 formally introduces basic concepts, such as error measures and flat minima. • Section 3 describes the novel algorithm, flat minimum search (FMS). • Section 4 formally derives the algorithm. • Section 5 reports experimental generalization results with feedforward and recurrent networks. For instance, in an application to stock market prediction, flat minimum search outperforms the following widely used competitors: conventional backpropagation, weight decay, and “optimal brain surgeon/optimal brain damage.” • Section 6 mentions relations to previous work. • Section 7 mentions limitations of the algorithm and outlines future work. • The Appendix presents a detailed theoretical justification of our approach. Using a variant of the Gibbs algorithm, Section A.1 defines generalization, underfitting and overfitting error in a novel way. By defining an appropriate prior over input-output functions, we postulate that the most probable network is a flat one. Section A.2 formally justifies the error function minimized by our algorithm. Section A.3 shows how to compute derivatives required by the algorithm. 2 Task Architecture and Boxes 2.1 Generalization Task. The task is to approximate an unknown function f ⊂ X × Y mapping a finite set of possible inputs X ⊂ RN to a finite set of possible outputs Y ⊂ RK . A data set D is obtained from f (see Section A.1). All training information is given by a finite set D0 ⊂ D. D0 is called the training set. The pth element of D0 is denoted by an input-target pair (xp , yp ). 2.2 Architecture and Net Functions. For simplicity, we will focus on a standard feedforward net (but in the experiments, we will use recurrent nets as well). The net has N input units, K output units, L weights, and differentiable activation functions. It maps input vectors x ∈ RN to output vectors o(w, x) ∈ RK , where w is the L-dimensional weight vector, and the weight on the connection from unit j to i is denoted wij . The net func-

4

Sepp Hochreiter and Jurgen ¨ Schmidhuber

induced by w is denoted net(w): for¢ x ∈ RN , net(w)(x) = o(w, x) = ¡tion 1 o (w, x), o2 (w, x), . . . , oK−1 (w, x), oK (w, x) , where oi (w, x) denotes the ith component of o(w, x), corresponding to output unit i. 2.3 Training Error. We use squared error E(net(w), D0 ) :=

P

(xp ,yp )∈D0

kyp − o(w, xp )k2 , where k · k denotes the Euclidean norm. 2.4 Tolerable Error. To define a region in weight space with the property that each weight vector from that region leads to small error and similar output, we introduce the tolerable error Etol , a positive constant (see Section A.1 for a formal definition of Etol ). “Small” error is defined as being smaller than Etol . E(net(w), D0 ) > Etol implies underfitting. 2.5 Boxes. Each weight w satisfying E(net(w), D0 ) ≤ Etol defines an acceptable minimum (compare M(D0 ) in Section A.1). We are interested in a large region of connected acceptable minima, where each weight w within this region leads to almost identical net functions net(w). Such a region is called a flat minimum. We will see that flat minima correspond to low expected generalization error. To simplify the algorithm for finding a large connected region (see below), we do not consider maximal connected regions but focus on so-called boxes within regions: for each acceptable minimum w, its box Mw in weight space is an L-dimensional hypercuboid with center w. For simplicity, each edge of the box is taken to be parallel to one weight axis. Half the length of the box edge in the direction of the axis corresponding to weight wij is denoted by 1wij (X). The 1wij (X) are the maximal (positive) values such that for all L-dimensional vectors κ whose components κij are restricted by |κij | ≤ 1wij (X), we have: E(net(w), net(w + κ), X) ≤ ², where P E(net(w), net(w + κ), X) = x∈X ko(w, x) − o(w + κ, x)k2 , and ² is a small positive constant defining tolerable output changes (see also equation 3.1). Note that 1wij (X) depends on ². Since our algorithm does not use ², however, it is notationally suppressed. 1wij (X) gives the precision of wij . Mw ’s Q box volume is defined by V(1w(X)) := 2L i,j 1wij (X), where 1w(X) denotes the vector with components 1wij (X). Our goal is to find large boxes within flat minima. 3 The Algorithm Let X0 = {xp | (xp , yp ) ∈ D0 } denote the inputs of the training set. We approximate 1w(X) by 1w(X0 ), where 1w(X0 ) is defined like 1w(X) in Section 2.5 (replacing X by X0 ). For simplicity, we will abbreviate 1w(X0 ) as 1w. Starting with a random initial weight vector, flat minimum search (FMS) tries to find a w that not only has low E(net(w), D0 ) but also defines a box Mw with maximal box volume V(1w) and, consequently, minimal P ˜ B(w, X0 ) := − log( 21L V(1w)) = i,j − log 1wij . Note the relationship to

Flat Minima

5

MDL: B˜ is the number of bits required to describe the weights, whereas the number of bits needed to describe the yp , given w (with (xp , yp ) ∈ D0 ), can be bounded by fixing Etol (see Section A.1). In the next section we derive the following algorithm. We use gradient descent P to minimize E(w, D0 ) = E(net(w), D0 ) + λB(w, X0 ), where B(w, X0 ) = xp ∈X0 B(w, xp ), and 

Ã !2 X X ∂ok (w, xp ) 1  B(w, xp ) = −L log ² + log 2 ∂wij i,j k

(3.1)



2  ¯ k ¯ ¯ ∂o (w,xp ) ¯ ¯ ∂wij ¯ X X     . + L log r  ³ ´ 2  k P i,j

k

k

∂o (w,xp ) ∂wij

Here ok (w, xp ) is the activation of the kth output unit (given weight vector w and input xp ), ² is a constant, and λ is the regularization constant (or hyperparameter) that controls the trade-off between regularization and training error (see Section A.1). To minimize B(w, X0 ), for each xp ∈ X0 we have to compute ∂B(w, xp ) X ∂B(w, xp ) ∂ 2 ok (w, xp ) ³ k ´ = for all u, v. ∂o (w,xp ) ∂wuv ∂wij ∂wuv k,i,j ∂

(3.2)

∂wij

It can be shown that by using Pearlmutter’s and Møller’s efficient secondorder method, the gradient of B(w, xp ) can be computed in O(L) time. Therefore, our algorithm has the same order of computational complexity as standard backpropagation. 4 Derivation of the Algorithm 4.1 Outline. We are interested in weights representing nets with tolerable error but flat outputs (see Sections 2 and A.1). To find nets with flat outputs, two conditions will be defined to specify B(w, xp ) for xp ∈ X0 and, as a consequence, B(w, X0 ) (see Section 3). The first condition ensures flatness. The second condition enforces equal flatness in all weight space directions, to obtain low variance of the net functions induced by weights within a box. The second condition will be justified using an MDL-based argument. In both cases, linear approximations will be made (to be justified in Section A.2). 4.2 Formal Details. We are interested in weights causing tolerable error (see “acceptable minima” in Section 2) that can be perturbed without

6

Sepp Hochreiter and Jurgen ¨ Schmidhuber

causing significant output changes, thus indicating the presence of many neighboring weights leading to the same net function. By searching for the boxes from Section 2, we are actually searching for low-error weights whose perturbation does not significantly change the net function. In what follows we treat the input xp as fixed. For convenience, we suppress xp , abbreviating ok (w, xp ) by ok (w). Perturbing the weights w by δw P (with components δwij ), we obtain ED(w, δw) := k (ok (w + δw) − ok (w))2 , where ok (w) expresses ok ’s dependence on w (in what follows, however, w often will be suppressed for convenience; we abbreviate ok (w) by ok ). Linear approximation (justified in Section A.2) gives us flatness condition 1:  2 X X ∂ok  δwij  ≤ EDl,max (δw) ED(w, δw) ≈ EDl (δw) := ∂wij i,j k

(4.1)

2  ¯ ¯ X X ¯¯ ∂ok ¯¯  := ¯ ¯ |δwij | ≤ ² , ¯ ∂wij ¯ k

i,j

where ² > 0 defines tolerable output changes within a box and is small enough to allow for linear approximation (it does not appear in B(w, xp )’s and B(w, D0 )’s gradients; see Section 3). EDl is ED’s linear approximation, and EDl,max is max{EDl (w, δv)| ∀ij : δvij = ±δwij }. Flatness condition 1 is a “robustness condition” (or “fault tolerance condition” or “perturbation tolerance condition”; see, e.g., Minai and Williams 1994; Murray and Edwards 1993; Neti et al. 1992; Matsuoka 1992; Bishop 1993; Kerlirzin and Vallet 1993; Carter et al. 1990). Many boxes Mw satisfy flatness condition 1. To select a particular, very flat Mw , the following flatness condition 2 uses up degrees of freedom left by inequality 4.1: 2

∀i, j, u, v : (δwij )

X k

Ã

∂ok ∂wij

!2 = (δwuv )

2

X k

Ã

∂ok ∂wuv

!2 .

(4.2)

Flatness condition 2 enforces equal “directed errors” X X EDij (w, δwij ) = (ok (wij + δwij ) − ok (wij ))2 ≈ k

k

Ã

∂ok δwij ∂wij

!2 ,

where ok (wij ) has the obvious meaning, and δwij is the i, jth component of δw. Linear approximation is justified by the choice of ² in inequality 4.1. As will be seen in the MDL justification to be presented below, flatness condition 2 favors the box that minimizes the mean perturbation error within the box. This corresponds to minimizing the variance of the net functions induced by weights within the box (recall that ED(w, δw) is quadratic).

Flat Minima

7

4.3 Deriving the Algorithm from Flatness Conditions 1 and 2. We first solve equation 4.2 for |δwij |: v u P ³ k ´2 u ∂o u k ∂wuv u |δwij | = |δwuv |t (fixing u, v for all i, j) . P ³ ∂ok ´2 k

∂wij

Then we insert the |δwij | (with fixed u, v) into inequality 4.1 (replacing the second “≤” in 4.1 with “=”, because we search for the box with maximal volume). This gives us an equation for the |δwuv | (which depend on w, but this is notationally suppressed): |δwuv | =

√ ² v  2 . u ¯ ¯ u r ¯ ¯ k ∂o u ¯ ∂w ¯  P ³ ∂ok ´2 uP P ij   r u k ∂wuv t k  i,j P ³ k ´2  ∂o k

(4.3)

∂wij

The |δwij | (u, v is replaced by i, j) approximate the 1wij from Section 2. The box Mw is approximated by AMw , the box with center w and edge lengths 2δwij . Mw ’s volume V(1w) is approximated by AMw ’s box volume Q ˜ V(δw) := 2L ij |δwij |. Thus, B(w, xp ) (see Section 3) can be approximated by P 1 B(w, xp ) := − log 2L V(δw) = i,j − log |δwij |. This immediately leads to the algorithm given by equation 3.1. 4.4 How Can the Above Approximations Be Justified? The learning process itself enforces their validity (see Section A.2). Initially, the conditions above are valid only in a very small environment of an “initial” acceptable minimum. But during the search for new acceptable minima with more associated box volume, the corresponding environments are enlarged. Section A.2 will prove this for feedforward nets (experiments indicate that this appears to be true for recurrent nets as well). 4.5 Comments. Flatness condition 2 influences the algorithm as follows: (1) The algorithm prefers to increase the δwij ’s of weights whose current contributions are not important to compute the target output; (2) The algorithm enforces equal sensitivity of all output units with respect to weights of connections to hidden units. Hence, output units tend to share hidden units; that is, different hidden units tend to contribute equally to the computation of the target. The contributions of a particular hidden unit to different output unit activations tend to be equal too. Flatness condition 2 is essential. Flatness condition 1 by itself corresponds to nothing more than first-order derivative reduction (ordinary sensitivity

8

Sepp Hochreiter and Jurgen ¨ Schmidhuber

reduction). However, what we really want is to minimize the variance of the net functions induced by weights near the actual weight vector. Automatically, the algorithm treats units and weights in different layers differently, and takes the nature of the activation functions into account. 4.6 MDL Justification of Flatness Condition 2. Let us assume a sender wants to send a description of the function induced by w to a receiver who knows the inputs xp but not the targets yp , where (xp , yp ) ∈ D0 . The MDL principle suggests that the sender wants to minimize the expected description length of the net function. Let EDmean (w, X0 ) denote the mean value of ED on the box. Expected description length is approximated by µEDmean (w, X0 ) + B(w, X0 ) + c, where c, µ are positive constants. One way of seeing this is to apply Hinton and van Camp’s “bits back” argument to a uniform weight prior (EDmean corresponds to the output variance). However, we prefer to use a different argument: We encode each weight wij of the box center w by a bitstring according to the following procedure (1wij is given): (0) Define a variable interval Iij ⊂ R. (1) Make Iij equal to the interval constraining possible weight values. (2) While Iij 6⊂ [wij − 1wij , wij + 1wij ]: Divide Iij into two equally sized disjunct intervals I1 and I2 . If wij ∈ I1 , then Iij ← I1 ; write ‘1’. If wij ∈ I2 , then Iij ← I2 ; write ‘0’. The final set {Iij } corresponds to a bit box within our box. This bit box contains ˜ X0 ) + c, where Mw ’s center w and is described by a bitstring of length B(w, the constant c is independent of the box Mw . From ED(w, wb − w) (wb is the center of the bit box) and the bitstring describing the bit box, the receiver can compute w by selecting an initialization weight vector within the bit box and using gradient descent to decrease B(wa , X0 ) until ED(wa , wb − wa ) = ED(w, wb − w), where wa in the bit box denotes the receiver’s current approximation of w (wa is constantly updated by the receiver). This is like “FMS without targets.” Recall that the receiver knows the inputs xp . Since w corresponds to the weight vector with the highest degree of local flatness within the bit box, the receiver will find the correct w. ED(w, wb − w) is described by a gaussian distribution with mean zero. Hence, the description length of ED(w, wb − w) is µED(w, wb − w) (Shannon 1948). wb , the center of the bit box, cannot be known before training. However, we do know the expected description length of the net function,

Flat Minima

9

˜ which is µEDmean + B(w, X0 ) + c (c is a constant independent of w). Let us approximate EDmean : EDl,mean (w, δw) :=

1 V(δw)

Z AMw

EDl (w, δv)dδv

Ã X X 1 L1 = 2 (δwij )3 V(δw) 3 i,j k Ã ×

∂ok ∂wij

!2

Y

 δwuv 

u,vwith u,v6=i,j

Ã !2 ¢2 X ∂ok 1 X¡ . δwij = 3 i,j ∂wij k Among those w that lead to equal B(w, X0 ) (the negative logarithm of the box volume plus L log 2), we want to find those with minimal description length of the function induced by w. Using Lagrange multipliers (viewing the δwij as variables), it can be shown that EDl,mean is minimal under the condition B(w, X0 ) = constant iff flatness condition 2 holds. To conclude: With given box volume, we need flatness condition 2 to minimize the expected description length of the function induced by w. 5 Experimental Results 5.1 Experiment 1: Noisy Classification. 5.1.1 Task. The first task is taken from Pearlmutter and Rosenfeld (1991). The task is to decide whether the x-coordinate of a point in two-dimensional space exceeds zero (class 1) or does not (class 2). Noisy training–test examples are generated as follows: data points are obtained from a gaussian with zero mean and standard deviation 1.0, bounded in the interval [−3.0, 3.0]. The data points are misclassified with probability 0.05. Final input data are obtained by adding a zero mean gaussian with stdev 0.15 to the data points. In a test with 2 million data points, it was found that the procedure above leads to 9.27% misclassified data. No method will misclassify less than 9.27%, due to the inherent noise in the data (including the test data). The training set is based on 200 fixed data points (see Fig. 3). The test set is based on 120,000 data points. 5.1.2 Results. Ten conventional backprop (BP) nets were tested against ten equally initialized networks trained by flat minimum search (FMS). After 1000 epochs, the weights of our nets essentially stopped changing

10

Sepp Hochreiter and Jurgen ¨ Schmidhuber

Figure 3: The 200 input examples of the training set. Crosses represent data points from class 1. Squares represent data points from class 2.

(automatic “early stopping”), while BP kept changing weights to learn the outliers in the data set and overfit. In the end, our approach left a single hidden unit h with a maximal weight of 30.0 or −30.0 from the x-axis input. Unlike with BP, the other hidden units were effectively pruned away (outputs near zero). So was the y-axis input (zero weight to h). It can be shown that this corresponds to an “optimal” net with minimal numbers of units and weights. Table 1 illustrates the superior performance of our approach. Parameters: Learning rate: 0.1. Architecture: (2-20-1). Number of training epochs: 400,000. With FMS: Etol = 0.0001. See Section 5.6 for parameters common to all experiments. 5.2 Experiment 2: Recurrent Nets. 5.2.1 Time-varying inputs. The method works for continually running fully recurrent nets as well. At every time step, a recurrent net with sigmoid activations in [0, 1] sees an input vector from a stream of randomly chosen input vectors from the set {(0, 0), (0, 1), (1, 0), (1, 1)}. The task is to switch on

Flat Minima

11

Table 1: Ten Comparisons of Conventional Backpropagation (BP) and Flat Minimum Search (FMS).

Backpropagation

1 2 3 4 5

FMS

Backpropagation

FMS

MSE

dto

MSE

dto

MSE

dto

MSE

dto

0.220 0.223 0.222 0.213 0.222

1.35 1.16 1.37 1.18 1.24

0.193 0.189 0.186 0.181 0.195

0.00 6 0.09 7 0.13 8 0.01 9 0.25 10

0.219 0.215 0.214 0.218 0.214

1.24 1.14 1.10 1.21 1.21

0.187 0.187 0.185 0.190 0.188

0.04 0.07 0.01 0.09 0.07

Note: The MSE column shows mean squared error on the test set. The dto column shows the difference between the percentage of misclassifications and the optimal percentage (9.27). The remaining rows provide the analogous information for FMS, which clearly outperforms backpropagation.

the first output unit whenever an input (1, 0) had occurred two time steps ago and to switch on the second output unit without delay in response to any input (0, 1). The task can be solved by a single hidden unit. 5.2.2 Non-weight-decay-like results. With conventional recurrent net algorithms, after training, both hidden units were used to store the input vector. This was not so with our new approach. We trained 20 networks. All of them learned perfect solutions. As with weight decay, most weights to the output decayed to zero. But unlike with weight decay, strong inhibitory connections (−30.0) switched off one of the hidden units, effectively pruning it away. Parameters: Learning rate: 0.1. Architecture: (2-2-2). Number of training examples: 1,500. Etol = 0.0001. See Section 5.6 for parameters common to all experiments.

12

Sepp Hochreiter and Jurgen ¨ Schmidhuber

5.3 Experiment 3: Stock Market Prediction 1. 5.3.1 Task. We predict the DAX1 (the German stock market index) using fundamental indicators. Following Rehkugler and Poddig (1990), the net sees the following indicators: (1) German interest rate (Umlaufsrendite), (2) industrial production divided by money supply, and (3) business sentiments (IFO Gesch¨aftsklimaindex). The input (scaled in the interval [−3.4, 3.4]) is the difference between data from the current quarter and last year’s corresponding quarter. The goal is to predict the sign of next year’s corresponding DAX difference. 5.3.2 Details. The training set consists of 24 data vectors from 1966 to 1972. Positive DAX tendency is mapped to target 0.8; otherwise the target is -0.8. The test set consists of 68 data vectors from 1973 to 1990. FMS is compared against (1) conventional backpropagation (BP8) with 8 hidden units, (2) BP with 4 hidden units (BP4) (4 hidden units are chosen because pruning methods favor 4 hidden units, but 3 is not enough), (3) optimal brain surgeon (OBS; Hassibi and Stork 1993), with a few improvements (see Section 5.6), and (4) weight decay (WD) according to Weigend et al. (1991) (WD and OBS were chosen because they are well known and widely used). 5.3.3 Performance Measure. Since wrong predictions lead to loss of money, performance is measured as follows. The sum of incorrectly predicted DAX changes is subtracted from the sum of correctly predicted DAX changes. The result is divided by the sum of absolute DAX changes. 5.3.4 Results. Table 2 shows the results. Our method outperforms the other methods. Note that MSE is not a reasonable performance measure for this task. For instance, although FMS typically makes more correct classifications than WD, FMS’s MSE often exceeds WD’s. This is because WD’s wrong classifications tend to be close to 0, while FMS often prefers large weights yielding strong output activations. FMS’s few false classifications tend to contribute a lot to MSE. Parameters Learning rate: 0.01. Architecture: (3-8-1), except BP4 with (3-4-1). Number of training examples: 20 million.

1 Raw DAX version according to Statistisches Bundesamt (Federal Office of Statistics). Other data are from the same source (except for business sentiment). Collected by Christian Puritscher, for a diploma thesis in industrial management at LMU, Munich.

Flat Minima

13

Table 2: Comparisons of Conventional Backpropagation (BP4, BP8), Optimal Brain Surgeon (OBS), Weight Decay (WD), and Flat Minimum Search (FMS).

Method

BP8 BP4 OBS WD FMS

Train MSE

Test MSE

0.003 0.043 0.089 0.096 0.040

0.945 1.066 1.088 1.102 1.162

Removed w u

Max

14 22 24

47.33 42.02 48.89 44.47 47.74

3 4 4

Performance Min 25.74 42.02 27.17 36.47 39.70

Mean 37.76 42.02 41.73 43.49 43.62

Note: All nets except BP4 start out with eight hidden units. Each value is a mean of seven trials. Column MSE shows mean squared error. Column w shows the number of pruned weights. Column u shows the number of pruned units. The final three rows (max, min, mean) list maximal, minimal, and mean performance (see text) over seven trials. Note that test MSE is insignificant for performance evaluations (this is due to targets 0.8/ − 0.8, as opposed to the “real” DAX targets). Our method outperforms all other methods.

Method specific parameters: FMS: Etol = 0.13; 1λ = 0.001. WD: like with FMS, but w0 = 0.2. OBS: Etol = 0.015 (the same result was obtained with higher Etol values, e.g., 0.13). See Section 5.6 for parameters common to all experiments. 5.4 Experiment 4: Stock Market Prediction 2. 5.4.1 Task. We predict the DAX again, using the basic setup of the experiment in Section 5.3. However, the following modifications are introduced: • There are two additional inputs: dividend rate and foreign orders in manufacturing industry. • Monthly predictions are made. The net input is the difference between the current month’s data and last month’s data. The goal is to predict the sign of next month’s corresponding DAX difference. • There are 228 training examples and 100 test examples. • The target is the percentage of DAX change scaled in the interval [−1, 1] (outliers are ignored). • Performance of WD and FMS is also tested on networks “spoiled” by conventional BP (“WDR” and “FMSR”—the “R” stands for Retraining).

14

Sepp Hochreiter and Jurgen ¨ Schmidhuber

Table 3: Comparisons of Conventional Backpropagation (BP), Optimal Brain Surgeon (OBS), Weight Decay after Spoiling the Net with BP (WDR), Flat Minimum Search after Spoiling the Net with BP (FMSR), Weight Decay (WD), and Flat Minimum Search (FMS).

Method

BP OBS WDR FMSR WD FMS

Train MSE

Test MSE

Removed w u

Max

0.181 0.219 0.180 0.180 0.235 0.240

0.535 0.502 0.538 0.542 0.452 0.472

15 0 0 17 19

57.33 50.78 62.54 64.07 54.04 54.11

1 0 0 3 3

Performance Min 20.69 32.20 13.64 24.58 32.03 31.12

Mean 41.61 40.43 41.17 41.57 40.75 44.40

Note: All nets start out with eight hidden units. Each value is a mean of ten trials. Column MSE shows mean squared error. Column w shows the number of pruned weights, column u shows the number of pruned units, the final three rows (max, min, mean) list maximal, minimal and mean performance (see text) over ten trials (note again that MSE is an irrelevant performance measure for this task). Flat minimum search outperforms all other methods.

5.4.2 Results. Table 3 shows the results. The average performance of our method exceeds the ones of weight decay, OBS, and conventional BP. Table 3 also shows the superior performance of our approach when it comes to retraining “spoiled” networks (note that OBS is a retraining method by nature). FMS led to the best improvements in generalization performance. Parameters Learning rate: 0.01. Architecture: (5-8-1). Number of training examples: 20,000,000. Method-specific parameters: FMS: Etol = 0.235; 1λ = 0.0001; if Eaverage < Etol then 1λ is set to 0.001. WD: like with FMS, but w0 = 0.2. FMSR: like with FMS, but Etol = 0.15; number of retraining examples: 5,000,000. WDR: like with FMSR, but w0 = 0.2. OBS: Etol = 0.235. See Section 5.6 for parameters common to all experiments.

Flat Minima

15

5.5 Experiment 5: Stock Market Prediction 3. 5.5.1 Task. This time, we predict the DAX using weekly technical (as opposed to fundamental) indicators. The data (DAX values and 35 technical indicators) were provided by Bayerische Vereinsbank. 5.5.2 Data Analysis. To analyze the data, we computed (1) the pairwise correlation coefficients of the 35 technical indicators and (2) the maximal pairwise correlation coefficients of all indicators and all linear combinations of two indicators. This analysis revealed that only four indicators are not highly correlated. For such reasons, our nets see only the eight most recent DAX changes and the following technical indicators: (a) the DAX value, (b) change of 24-week relative strength index (RSI)—the relation of increasing tendency to decreasing tendency, (c) 5-week statistic, and (d) MACD (smoothened difference of exponentially weighted 6-week and 24week DAX). 5.5.3 Input Data. The final network input is obtained by scaling the values (a–d) and the eight most recent DAX changes in [−2, 2]. The training set consists of 320 data points (July 1985–August 1991). The targets are the actual DAX changes scaled in [−1, 1]. 5.5.4 Comparison. The following methods are applied to the training set: (1) conventional BP, (2) optimal brain surgeon/optimal brain damage (OBS/OBD), (3) weight decay (WD) according to Weigend et al. (1991), and (4) flat minimum search (FMS). The resulting nets are evaluated on a test set consisting of 100 data points (August 1991–July 1993). Performance is measured as in Section 5.3. 5.5.5 Results. Table 4 shows the results. Again, our method outperforms the other methods. Parameters Learning rate: 0.01. Architecture: (12-9-1). Training time: 10 million examples. Method-specific parameters: OBS/OBD: Etol = 0.34. FMS: Etol = 0.34; 1λ = 0.003. If Eaverage < Etol then 1λ is set to 0.03. WD: like with FMS, but w0 = 0.2. See Section 5.6 for parameters common to all experiments.

16

Sepp Hochreiter and Jurgen ¨ Schmidhuber

Table 4: Comparisons of Conventional Backpropagation (BP), Optimal Brain Surgeon (OBS), Weight Decay (WD), and Flat Minimum Search (FMS).

Method

BP OBS WD FMS

Train MSE

Test MSE

Removed w u

Max

0.13 0.38 0.51 0.46

1.08 0.912 0.334 0.348

55 110 103

28.45 27.37 26.84 29.72

1 8 7

Performance Min −16.7 −6.08 −6.88 18.09

Mean 8.08 10.70 12.97 21.26

Note: All nets start out with nine hidden units. Each value is a mean of ten trials. Column MSE shows mean squared error. Column w shows the number of pruned weights. Column u shows the number of pruned units. The final three rows (max, min, mean) list maximal, minimal, and mean performance (see text) over ten trials (note again that MSE is an irrelevant performance measure for this task). Flat minimum search outperforms all other methods.

5.6 Details and Parameters. With the exception of the experiment in Section 5.2, all units are sigmoid in the range of [−1.0, 1.0]. Weights are constrained to [−30, 30] and initialized in [−0.1, 0.1]. The latter ensures high first-order derivatives in the beginning of the learning phase. WD is set up to barely punish weights below w0 = 0.2. Eaverage is the average error on the training set, approximated using exponential decay: Eaverage ← γ Eaverage + (1 − γ )E(net(w), D0 ), where γ = 0.85. 5.6.1 FMS Details. To control B(w, D0 )’s influence during learning, its gradient is normalized and multiplied by the length of E(net(w), D0 )’s gradient (same for weight decay; see below). λ is computed as in Weigend et al. (1991) and initialized with 0. Absolute values of first-order derivatives are replaced by 10−20 if below this value. We ought to judge a weight wij as being pruned if δwij (see equation 4.5) exceeds the length of the weight range. However, the unknown scaling factor ² (see inequality 4.3 and equation 4.5) is required to compute δwij . Therefore, we judge a weight wij as being pruned if, with arbitrary ², δwij is much bigger than the corresponding δ’s of the other weights (typically, there are clearly separable classes of weights with high and low δ’s, which differ from each other by a factor ranging from 102 to 105 ). If all weights to and from a particular unit are very close to zero, the unit is lost. Due to tiny derivatives, the weights will never again increase significantly. Sometimes it is necessary to bring lost units back into the game. For this purpose, every ninit time steps (typically, ninit = 500,000), all weights wij with 0 ≤ wij < 0.01 are randomly reinitialized in [0.005, 0.01]; all weights

Flat Minima

17

wij with 0 ≥ wij > −0.01 are randomly initialized in [−0.01, −0.005], and λ is set to 0. 5.6.2 Weight Decay Details. We used Weigend et al.’s weight decay term: P w2 /w0 D(w) = i,j 1+wij 2 /w . As with FMS, D(w, w0 )’s gradient was normalized and ij

0

multiplied by the length of E(net(w), D0 )’s gradient. λ was adjusted as with FMS. Lost units were brought back as with FMS. 5.6.3 Modifications of OBS. Typically most weights exceed 1.0 after training. Therefore, higher-order terms of δw in the Taylor expansion of the error function do not vanish. Hence, OBS is not fully theoretically justified. Still, we used OBS to delete high weights, assuming that higher-order derivatives are small if second-order derivatives are. To obtain reasonable performance, we modified the original OBS procedure (notation following Hassibi and Stork 1993): • To detect the weight that deserves deletion, we use both Lq = (the original value used by Hassibi and Stork) and Tq := 1∂ E 2 2 ∂w2q wq . Here H 2

w2q [H−1 ]qq

∂E ∂wq wq

+

denotes the Hessian and H−1 its approximate inverse.

We delete the weight-causing minimal training set error (after tentative deletion). • As with OBD (LeCun et al. 1990), to prevent numerical errors due to small eigenvalues of H, we do: if Lq < 0.00001 or Tq < 0.00001 or kI − H−1 Hk > 10.0 (bad approximation of H−1 ), we delete only the weight detected in the previous step. The other weights remain the same. Here k · k denotes the sum of the absolute values of all components of a matrix. • If OBS’s adjustment of the remaining weights leads to at least one absolute weight change exceeding 5.0, then δw is scaled such that the maximal absolute weight change is 5.0. This leads to better performance (also due to small eigenvalues). • If Eaverage > Etol after weight deletion, then the net is retrained until either Eaverage < Etol or the number of training examples exceeds 800,000. Practical experience indicates that the choice of Etol barely influences the result. • OBS is stopped if Eaverage > Etol after retraining. The most recent weight deletion is countermanded.

18

Sepp Hochreiter and Jurgen ¨ Schmidhuber

6 Relation to Previous Work Most previous algorithms for finding low-complexity networks with high generalization capability are based on different prior assumptions. They can be broadly classified into two categories (see Schmidhuber 1994a for an exception): 1. Assumptions about the prior weight distribution. Hinton and van Camp (1993) and Williams (1994) assume that pushing the posterior weight distribution close to the weight prior leads to “good” generalization (see more details below). Weight decay (e.g., Hanson and Pratt 1989; Krogh and Hertz 1992) can be derived, for example, from gaussian or Laplace weight priors. Nowlan and Hinton (1992) assume that a distribution of networks with many similar weights generated by gaussian mixtures is “better” a priori. MacKay’s weight priors (1992b) are implicit in additional penalty terms, which embody the assumptions made. The problem with these approaches is that there may not be a “good” weight prior for all possible architectures and training sets. With FMS, however, we do not have to select a “good” weight prior but instead choose a prior over input-output functions. This automatically takes the net architecture and the training set into account. 2. Prior assumptions about how theoretical results on early stopping and network complexity carry over to practical applications. Such assumptions are implicit in methods based on validation sets (Mosteller and Tukey 1968; Stone 1974; Eubank 1988; Hastie and Tibshirani 1990), for example, generalized cross-validation (Craven and Wahba 1979; Golub et al. 1979), final prediction error (Akaike 1970), and generalized prediction error (Moody and Utans 1992; Moody 1992). See also Holden (1994), Wang et al. (1994), Amari and Murata (1993), and Vapnik’s structural risk minimization (Guyon et al. 1992; Vapnik 1992). 6.1 Constructive Algorithms and Pruning Algorithms. Other architecture selection methods are less flexible in the sense that they can be used only before or after weight adjustments. Examples are sequential network construction (Fahlman and Lebiere 1990; Ash 1989; Moody 1989), input pruning (Moody 1992; Refenes et al. 1994), unit pruning (White 1989; Mozer and Smolensky 1989; Levin et al. 1994), and weight pruning, for example, optimal brain damage (LeCun et al. 1990), and optimal brain surgeon (Hassibi and Stork 1993). 6.2 Hinton and van Camp (1993). They minimize the sum of twoRterms. The first is conventional error plus variance; the other is the distance p(w | p(w|D ) D0 ) log p(w)0 dw between posterior p(w | D0 ) and weight prior p(w). They

Flat Minima

19

have to choose a “good” weight prior. But perhaps there is no “good” weight prior for all possible architectures and training sets. With FMS, however, we do not depend on a “good” weight prior; instead we have a prior over input-output functions, thus taking into account net architecture and training set. Furthermore, Hinton and van Camp have to compute variances of weights and unit activations, which (in general) cannot be done using linear approximation. Intuitively, their weight variances are related to our 1wij . Our approach, however, does justify linear approximation, as seen in Section A.2. 6.3 Wolpert (1994a). His (purely theoretical) analysis suggests an interesting different additional error term (taking into account local flatness in all directions): the logarithm of the Jacobi determinant of the functional from weight space to the space of possible nets. This term is small if the net output (based on the current weight vector) is locally flat in weight space (if many neighboring weights lead to the same net function in the space of possible net functions). It is not clear, however, how to derive a practical algorithm (e.g., a pruning algorithm) from this. 6.4 Murray and Edwards (1993). They obtain additional error terms consisting of weight squares and second-order derivatives. Unlike our approach, theirs explicitly prefers weights near zero. In addition, their approach appears to require much more computation time (due to secondorder derivatives in the error term). 7 Limitations, Final Remarks, and Future Research 7.1 How to Adjust λ. Given recent trends in neural computing (see, e.g., MacKay 1992a, 1992b), it may seem like a step backward that λ is adapted using an ad hoc heuristic from Weigend et al. (1991). However, for determining λ in MacKay’s style, one would have to compute the Hessian of the cost function. Since our term B(w, X0 ) includes first-order derivatives, adjusting λ would require the computation of third-order derivatives. This is impracticable. Also, to optimize the regularizing parameter λ (see MacKay R 1992b), we need to compute the function dL w exp(−λB(w, X0 )), but it is not obvious how; the “quick and dirty version” (MacKay 1992a) cannot deal with the unknown constant ² in B(w, X0 ). Future work will investigate how to adjust λ without too much computational effort. In fact, as will be seen in Section A.1, the choices of λ and Etol are correlated. The optimal choice of Etol may indeed correspond to the optimal choice of λ. 7.2 Generalized Boxes. The boxes found by the current version of FMS are axis aligned. This may cause an underestimate of flat minimum volume. Although our experiments indicate that box search works very well, it

20

Sepp Hochreiter and Jurgen ¨ Schmidhuber

will be interesting to compare alternative approximations of flat minimum volumes. 7.3 Multiple Initializations. First, consider this FMS alternative: Run conventional BP starting with several random initial guesses and pick the flattest minimum with the largest volume. This does not work. Conventional BP changes the weights according to the steepest descent; it runs away from flat ranges in weight space. Using an “FMS committee” (multiple runs with different initializations), however, would lead to a better approximation of the posterior. This is left for future work. 7.4 Notes on Generalization Error. If the prior distribution of targets p( f ) (see Section A.1) is uniform (or if the distribution of prior distributions is uniform), no algorithm can obtain a lower expected generalization error than training error reducing algorithms (see, e.g., Wolpert 1994b). Typical target distributions in the real world are not uniform, however; the real world appears to favor problem solutions with low algorithmic complexity (see, e.g., Schmidhuber 1994a). MacKay (1992a) suggests searching for alternative priors if the generalization error indicates a “poor regularizer.” He also points out that with a “good” approximation of the nonuniform prior, more probable posterior hypotheses do not necessarily have a lower generalization error. For instance, there may be noise on the test set, or two hypotheses representing the same function may have different posterior values, and the expected generalization error ought to be computed over the whole posterior, not for a single solution. Schmidhuber (1994b) proposes a general “self-improving” system whose entire life is viewed as a single training sequence and continually attempts to modify its priors incrementally based on experience with previous problems (see also Schmidhuber 1996). 7.5 Ongoing Work on Low-Complexity Coding. FMS can also be useful for unsupervised learning. In recent work, we postulate that a generally useful code of given input data fulfills three MDL-inspired criteria: (1) It conveys information about the input data, (2) it can be computed from the data by a low-complexity mapping, and (3) the data can be computed from the code by a low-complexity mapping. To obtain such codes, we simply train an auto-associator with FMS (after training, codes are represented across the hidden units). In initial experiments, depending on data and architecture, this always led to well-known kinds of codes considered useful in previous work by numerous researchers. We sometimes obtained factorial codes, sometimes local codes, and sometimes sparse codes. In most cases, the codes were of the low-redundancy, binary kind. Initial experiments with a speech data benchmark problem (vowel recognition) already showed the true usefulness of codes obtained by FMS: Feeding the codes into standard,

Flat Minima

21

supervised, overfitting backpropagation classifiers, we obtained much better generalization performance than competing approaches. Appendix – Theoretical Justification An alternative version of this Appendix (but with some minor errors) can be found in Hochreiter and Schmidhuber (1994). An expanded version of this Appendix (with a detailed description of the algorithm) is available on the World-Wide Web (see our home pages). A.1. Flat Nets: The Most Probabale Hypotheses—Outline. We introduce a novel kind of generalization error that can be split into an overfitting error and an underfitting error. To find hypotheses causing low generalization error, we first select a subset of hypotheses causing low underfitting error. We are interested in those of its elements causing low overfitting error. After listing relevant definitions we will introduce a somewhat unconventional variant of the Gibbs algorithm, designed to take into account that FMS uses only the training data D0 to determine G(· | D0 ), a distribution over the set of hypotheses expressing our prior belief in hypotheses (here we do not care where the data came from; this will be treated later). This variant of the Gibbs algorithm will help us to introduce the concept of expected extended generalization error, which can be split into an overfitting error (relevant for measuring whether the learning algorithm focuses too much on the training set) and an underfitting error (relevant for measuring whether the algorithm sufficiently approximates the training set). To obtain these errors, we measure the Kullback-Leibler distance between posterior p(· | D0 ) after training on the training set and posterior pD0 (· | D) after (hypothetical) training on all data (here the subscript D0 indicates that for learning D, G(· | D0 ) is used as prior belief in hypotheses too). The overfitting error measures the information conveyed by p(· | D0 ) but not by pD0 (· | D). The underfitting error measures the information conveyed by pD0 (· | D), but not by p(· | D0 ). We then introduce the tolerable error level and the set of acceptable minima. The latter contains hypotheses with low underfitting error, assuming that D0 indeed conveys information about the test set (every training-seterror-reducing algorithm makes this assumption). In the remainder of the Appendix, we will focus only on hypotheses within the set of acceptable minima. We introduce the relative overfitting error, which is the relative contribution of a hypothesis to the mean overfitting error on the set of acceptable minima. The relative overfitting error measures the overfitting error of hypotheses with low underfitting error. The goal is to find a hypothesis with low overfitting error and, consequently, low generalization error. The relative overfitting error is approximated based on the trade-off between low training set error and large values of G(· | D0 ). The distribution

22

Sepp Hochreiter and Jurgen ¨ Schmidhuber

G(· | D0 ) is restricted to the set of acceptable minima, to obtain the distribution GM(D0 ) (· | D0 ). We then assume the data are obtained from a target chosen according to a given prior distribution. Using previously introduced distributions, we derive the expected test set error and the expected relative overfitting error. We want to reduce the latter by choosing a certain GM(D0 ) (· | D0 ) and G(· | D0 ). The special case of noise-free data is considered. To be able to minimize the expected relative overfitting error, we need to adopt a certain prior belief p( f ). The only unknown distributions required to determine GM(D0 ) (· | D0 ) are p(D0 | f ) and p(D | f ). They describe how (noisy) data are obtained from the target. We have to make the following assumptions: the choice of prior belief is “appropriate,” the noise on data drawn from the target has mean 0, and small noise is more probable than large noise (the noise assumptions ensure that reducing the training error— by choosing some h from M(D0 )—reduces the expected underfitting error). We do not need gaussian assumptions, though. We show that FMS approximates our special variant of the Gibbs algorithm. The prior is approximated locally in weight space, and flat net(w) are approximated by flat net(w0 ) with w0 near w in weight space. A.1.1. Definitions. Let A = {(x, y) | x ∈ X, y ∈ Y} be the set of all possible input-output pairs (pairs of vectors). Let NET be the set of functions that can be implemented by the network. For every net function g ∈ NET we have g ⊂ A. Elements of NET are parameterized with a parameter vector w from the set of possible parameters W. net(w) is a function that maps a parameter vector w onto a net function g (net is surjective). Let T be the set of target functions f , where T ⊂ NET. Let H be the set of hypothesis functions h, where H ⊂ T. For simplicity, take all sets to be finite, and let all functions map each x ∈ X to some y ∈ Y. Values of functions with argument x are denoted by g(x), net(w)(x), f (x), h(x). We have (x, g(x)) ∈ g; (x, net(w)(x)) ∈ net(w); (x, f (x)) ∈ f ; (x, h(x)) ∈ h. Let D = {(xp , yp ) | 1 ≤ p ≤ m} be the data, where D ⊂ A. D is divided into a training set D0 = {(xp , yp ) | 1 ≤ p ≤ n} and a test set D \ D0 = {(xp , yp ) | n < p ≤ m}. For the moment, we are not D was obtained. Pminterested in how kyp − h(xp )k2 , where k · k is the We use squared error E(D, h) := p=1 Pn Pm Euclidean norm. E(D0 , h) := p=1 kyp −h(xp )k2 . E(D\D0 , h) := p=n+1 kyp − h(xp )k2 . E(D, h) = E(D0 , h) + E(D \ D0 , h) holds. A.1.2. Learning. We use a variant of the Gibbs formalism (see Opper and Haussler 1991, or Levin et al. 1990). Consider a stochastic learning algorithm (random weight initialization, random learning rate). The learning algorithm attempts to reduce training set error by randomly selecting a hypothesis with low E(D0 , h), according to some conditional distribution

Flat Minima

23

G(· | D0 ) over H. G(· | D0 ) is chosen in advance, but in contrast to traditional Gibbs (which deals with unconditional distributions on H), we may take a look at the training set before selecting G. For instance, one training set may suggest linear functions as being more probable than others, another one splines, and so forth. The unconventional Gibbs variant is appropriate because FMS uses only X0 (the set of first components of D0 ’s elements; see Section 3) to compute the flatness of net(w0 ). The trade-off between the desire for low E(D0 , h) and the a priori belief in a hypothesis according to G(· | D0 ) is governed by a positive constant β (interpretable as the inverse temperature from statistical mechanics, or the amount of stochasticity in the training algorithm). We obtain p(h | D0 ), the learning algorithm applied to data D0 : p(h | D0 ) = where Z(D0 , β) =

G(h | D0 ) exp(−βE(D0 , h)) , Z(D0 , β) X

G(h | D0 ) exp(−βE(D0 , h)) .

(A.1)

(A.2)

h∈H

Z(D0 , β) is the error momentum generating function or the weighted accessible volume in configuration space or the partition function (from statistical mechanics). For theoretical purposes, assume we know D and may use it for learning. To learn, we use the same distribution G(h | D0 ) as above (prior belief in some hypotheses h is based exclusively on the training set). There is a reason that we do not use G(h | D) instead: G(h | D) does not allow for making a distinction between a better prior belief in hypotheses and a better approximation of the test set data. However, we are interested in how G(h | D0 ) performs on the test set data D \ D0 . We obtain pD0 (h | D) = where ZD0 (D, β) =

G(h | D0 ) exp(−βE(D, h)) , ZD0 (D, β) X

G(h | D0 ) exp(−βE(D, h)).

(A.3)

(A.4)

h∈H

The subscript D0 indicates that the prior belief is chosen based on D0 only. A.1.3. Expected Extended Generalization Error. We define the expected extended generalization error EG (D, D0 ) on the unseen test exemplars D\D0 : X EG (D, D0 ) := p(h | D0 )E(D \ D0 , h) (A.5) h∈H

−

X

h∈H

pD0 (h | D)E(D \ D0 , h).

24

Sepp Hochreiter and Jurgen ¨ Schmidhuber

Here EG (D, D0 ) is the mean error on D \ D0 after learning with D0 , minus the mean error on D \ D0 after learning with D. The second (negative) term is a lower bound (due to nonzero temperature) for the error on D \ D0 after learning the training set D0 . For the zero temperature limit β → ∞ we get (summation convention explained at the end of this paragraph) X

EG (D, D0 ) =

h∈H,D0

G(h | D0 ) E(D \ D0 , h), Z(D0 ) ⊂h

where Z(D0 ) =

X

G(h | D0 ).

h∈H,D0 ⊂h

In this case, the generalization error depends on G(h | D0 ), restricted to those hypotheses h compatible with D0 (D0 ⊂ h). For β → 0 (full stochasticity), we get EG (D, D0 ) = 0. P Summation convention: In general, h∈H,D0 ⊂h denotes summation over those h satisfying h ∈ H and D0 ⊂ h. In what follows, we will keep an analogous convention: the first symbol is the running index, for which additional expressions specify conditions. A.1.4. Overfitting and Underfitting Error. Let us separate the generalization error into an overfitting error Eo and an underfitting error Eu (in analogy to Wang et al. 1994 and Guyon et al. 1992). We will see that overfitting and underfitting error correspond to the two different error terms in our algorithm: decreasing one term is equivalent to decreasing Eo , and decreasing the other is equivalent to decreasing Eu . Using the Kullback-Leibler distance (Kullback 1959), we measure the information conveyed by p(· | D0 ), but not by pD0 (· | D) (see Fig. 4). We may view this as information about G(· | D0 ): since there are more h that are compatible with D0 than there are h compatible with D, G(· | D0 )’s influence on p(h | D0 ) is stronger than its influence on pD0 (h | D). To get the nonstochastic bias (see definition of EG ), we divide this information by β and obtain the overfitting error: Eo (D, D0 ) := =

1X p(h | D0 ) p(h | D0 ) ln β h∈H pD0 (h | D) X h∈H

p(h | D0 )E(D \ D0 , h) +

(A.6)

ZD0 (D, β) 1 ln . β Z(D0 , β)

Analogously, we measure the information conveyed by pD0 (· | D), but not by p(· | D0 ) (see Fig. 5). This information is about D \ D0 . To get the nonstochastic bias (see the definition of EG ), we divide this information by

Flat Minima

25

Figure 4: Positive contributions to the overfitting error Eo (D, D0 ), after learning the training set with a large β.

β and obtain the underfitting error: Eu (D, D0 ) :=

1X pD (h | D) pD0 (h | D) ln 0 β h∈H p(h | D0 )

= −

X h∈H

pD0 (h | D)E(D \ D0 , h) +

(A.7) Z(D0 , β) 1 ln . β ZD0 (D, β)

Peaks in G(· | D0 ) that do not match peaks of pD0 (· | D) produced by D \ D0 lead to overfitting error. Peaks of pD0 (· | D) produced by D \ D0 that do not match peaks of G(· | D0 ) lead to underfitting error. Overfitting and underfitting error tell us something about the shape of G(· | D0 ) with respect to D \ D0 , that is, to what degree the prior belief in h is compatible with D \ D0 . A.1.5. Why Are They Called “Overfitting” and “Underfitting” Error? Positive contributions to the overfitting error are obtained where peaks of p(· | D0 ) do not match (or are higher than) peaks of pD0 (· | D): there some h will have large probability after training on D0 but lower probability after training on all data D. This is either because D0 has been approximated too closely or because of sharp peaks in G(· | D0 ); the learning algorithm specializes either on D0 or on G(· | D0 ) (“overfitting”). The specialization on D0 will become even worse if D0 is corrupted by noise (the case of noisy D0 will be treated later). Positive contributions to the underfitting error are obtained where peaks of pD0 (· | D) do not match (or are higher than) peaks of p(· | D0 ): there some h will have large probability after training on all data

26

Sepp Hochreiter and Jurgen ¨ Schmidhuber

Figure 5: Positive contributions to the underfitting error Eu (D0 , D), after learning the training set with a small β. Again, we use the D-posterior from Figure 4, assuming it is almost fully determined by E(D, h) (even if β is smaller than in Figure 4).

D but will have lower probability after training on D0 . This is due to either a poor D0 approximation (note that p(· | D0 ) is almost fully determined by G(· | D0 )), or to insufficient information about D conveyed by D0 (“underfitting”). Either the algorithm did not learn “enough” of D0 , or D0 does not tell us anything about D. In the latter case, there is nothing we can do; we have to focus on the case where we did not learn enough about D0 . A.1.6. Analysis of Overfitting and Underfitting Error. EG (D, D0 ) = Eo (D, )+E D u (D, D0 ) holds. For zero temperature P limit β → ∞ we obtain ZD0 (D) = P0 G(h | D ) and Z(D ) = 0 0 h∈H,D0 ⊂h G(h | D0 ). Eo (D, D0 ) = Ph∈H,D⊂h G(h|D0 ) h∈H,D0 ⊂h Z(D0 ) E(D \ D0 , h) = EG (D, D0 ). Eu (D, D0 ) = 0; that is, there is no underfitting error. For β → 0 (full stochasticity) we get Eu (D, D0 ) = 0 and Eo (D, D0 ) = 0 (recall that EG is not the conventional but the extended expected generalization error). Since D0 ⊂ D, ZD0 (D, β) < Z(D0 , β) holds. In what follows, averages after learning on D0 are denoted by hD0 .i, and averages after learning on D are denoted by hD ·i. Since

ZD0 (D, β) =

X h∈H

G(h | D0 ) exp(−βE(D0 , h)) exp(−βE(D \ D0 , h)),

Flat Minima

27

we have X ZD0 (D, β) = p(h | D0 ) exp(−βE(D \ D0 , h)) Z(D0 , β) h∈H = hD0 exp(−βE(D \ D0 , ·))i . Analogously, we have hD0E(D \ D0 , ·)i +

1 β

Z(D0 ,β) ZD0 (D,β)

= hD exp(βE(D \ D0 , ·))i. Thus, Eo (D, D0 ) =

lnhD0 exp(−βE(D \ D0 , ·))i, and Eu (D, D0 ) = −hD E(D \

D0 , ·)i + lnhD exp(βE(D \ D0 , ·))i.2 With large β, after learning on D0 , Eo measures the difference between average test set error and a minimal test set error. With large β, after learning on D, Eu measures the difference between average test set error and a maximal test set error. Assume we do have a 1 β

ZD (D,β)

0 ). We have to large β (large enough to exceed the minimum of β1 ln Z(D 0 ,β) assume that D0 indeed conveys information about the test set. Preferring hypotheses h with small E(D0 , h) by using a larger β leads to smaller test set error (without this assumption, no error-decreasing algorithm would make sense). Eu can be decreased by enforcing less stochasticity (by further increasing β), but this will increase Eo . Similarly, decreasing β (enforcing more stochasticity) will decrease Eo but increase Eu . Increasing β decreases the maximal test set error after learning D more than it decreases the average test set error, thus decreasing Eu , and vice versa. Decreasing β increases the minimal test set error after learning D0 more than it increases the average test set error, thus decreasing Eo , and vice versa. This is the trade-off between stochasticity and fitting the training set, governed by β.

A.1.7. Tolerable Error Level and Set of Acceptable Minima. Let us implicitly define a tolerable error level Etol (α, β) that, with confidence 1 − α, is the upper bound of the training set error after learning: X

p(E(D0 , h) ≤ Etol (α, β)) =

p(h | D0 ) = 1 − α .

(A.8)

h∈H,E(D0 ,h)≤Etol (α,β)

With (1−α)-confidence, we have E(D0 , h) ≤ Etol (α, β) after learning. Etol (α, β) decreases with increasing β, α. Now we define M(D0 ) := {h ∈ H | E(D0 , h) ≤ Etol (α, β)}, which is the set of acceptable minima (see Section 2). The set of acceptable minima is a set of hypotheses with low underfitting error. 2

We have −

∂ 2 ln Z(D0 ,β) ∂β 2

∂ ln Z(D0 ,β) ∂β

∂ ln ZD0 (D,β) ∂β ∂ 2 ln ZD0 (D,β)

= hD0 E(D0 , ·)i and − , ·)i)2 i

= hD E(D, ·)i. Furthermore,

= hD0 (E(D0 , ·) − hD0 E(D0 and = hD (E(D, ·) − hD E(D, ·)i)2 i. ∂β 2 See also Levin et al. (1990). Using these expressions, it can be shown: by increasing β (starting from β = 0), we will find a β that minimizes further makes this expression go to 0.

1 β

ln

ZD0 (D,β) Z(D0 ,β)

< 0. Increasing β

28

Sepp Hochreiter and Jurgen ¨ Schmidhuber

With probability 1 − α, the learning algorithm selects a hypothesis from M(D0 ) ⊂ H. Note that for the zero temperature limit β → ∞, we have Etol (α) = 0 and M(D0 ) = {h ∈ H | D0 ⊂ h}. By fixing a small Etol (or a large β), Eu will be forced to be low. We would like to have an algorithm decreasing (1) training set error (this corresponds to decreasing underfitting error) and (2) an additional error term, which should be designed to ensure low overfitting error, given a fixed small Etol . The remainder of Section A.1 will lead to an answer as to how to design this additional error term. Since low underfitting is obtained by selecting a hypothesis from M(D0 ), in what follows we will focus on M(D0 ) only. Using an appropriate choice of prior belief, at the end of this section, we will finally see that the overfitting error can be reduced by an error term expressing preference for flat nets. A.1.8. Relative Overfitting Error. Let us formally define the relative overfitting error Ero , which is the relative contribution of some h ∈ M(D0 ) to the mean overfitting error of hypotheses set M(D0 ): E (D, D0 , M(D0 ), h) = pM(D0 ) (h | D0 )E(D \ D0 , h), where ro p(h | D0 ) h∈M(D0 ) p(h | D0 )

pM(D0 ) (h | D0 ) := P

(A.9) (A.10)

for h ∈ M(D0 ), and zero otherwise. For h ∈ M(D0 ), we approximate p(h | D0 ) as follows. We assume that G(h | D0 ) is large where E(D0 , h) is large (trade-off between low E(D0 , h) and G(h | D0 )). Then p(h | D0 ) has large values (due to large G(h | D0 )) where E(D0 , h) ≈ Etol (α, β) (assuming Etol (α, β) is small). We get p(h | D0 ) ≈

G(h | D0 ) exp(−βEtol (α, β)) Z(D0 , β)

The relative overfitting error can now be approximated by Ero (D, D0 , M(D0 ), h) ≈ P

G(h | D0 ) E(D \ D0 , h) . h∈M(D0 ) G(h | D0 )

(A.11)

To obtain a distribution over M(D0 ), we introduce GM(D0 ) (· | D0 ), the normalized distribution G(· | D0 ) restricted to M(D0 ). For approximation A.11 we have Ero (D, D0 , M(D0 ), h) ≈ GM(D0 ) (h | D0 )E(D \ D0 , h) .

(A.12)

A.1.9. Prior Belief in f and D. Assume D was obtained from a target function f . Let p( f ) be the prior on targets and p(D | f ) the probability of

Flat Minima

29

obtaining D with a given f . We have p( f | D0 ) =

p(D0 | f )p( f ) , p(D0 )

(A.13)

P where p(D0 ) = f ∈T p(D0 | f )p( f ) . The data are drawn from a target function with added noise (the noisefree case is treated below). We do not make any assumptions about the nature of the noise; it does not have to be gaussian (as in MacKay’s work, 1992b). We want to select a G(· | D0 ) that makes Ero small; that is, those h ∈ M(D0 ) with small E(D \ D0 , h) should have high probabilities G(h | D0 ). We do not know D \ D0 during learning. D is assumed to be drawn from a target f . We compute the expectation of Ero , given D0 . The probability of the test set D \ D0 , given D0 , is p(D \ D0 | D0 ) =

X

p(D \ D0 | f )p( f | D0 ),

(A.14)

f ∈T

where we assume p(D \ D0 | f, D0 ) = p(D \ D0 | f ) (we do not remember which exemplars were already drawn). The expected test set error E(·, h) for some h, given D0 , is X

p(D \ D0 | D0 )E(D \ D0 , h)

D\D0

=

X f ∈T

p( f | D0 )

X

(A.15)

p(D \ D0 | f )E(D \ D0 , h) .

D\D0

The expected relative overfitting error Ero (·, D0 , M(D0 ), h) is obtained by inserting equation A.15 into equation A.12: Ero (·, D0 , M(D0 ), h) X X ≈ GM(D0 ) (h | D0 ) p( f | D0 ) p(D \ D0 | f )E(D \ D0 , h). f ∈T

(A.16)

D\D0

A.1.10. Minimizing Expected Relative Overfitting Error. We define a GM(D0 ) (· | D0 ) such that GM(D0 ) (· | D0 ) has its largest value near small expected test set error E(·, ·) (see equations A.12 and A.15). This definition leads to a low expectation of Ero (·, D0 , M(D0 ), ·) (see equation A.16). Define GM(D0 ) (h | D0 ) := δ(argminh0 ∈M(D0 ) (E(·, h0 )) − h),

(A.17)

where δ is the Dirac delta function, which we will use with loose formalism; the context will make clear how the delta function is used.

30

Sepp Hochreiter and Jurgen ¨ Schmidhuber

Using equation A.15 we get GM(D0 ) (h | D0 ) 



X

= δ argminh0 ∈M(D0 ) 

X

p( f | D0 )

f ∈T

(A.18)   p(D \ D0 | f )E(D \ D0 , h0 ) − h .

D\D0

GM(D0 ) (· | D0 ) determines the hypothesis h from M(D0 ) that leads to the lowest expected test set error. Consequently, we achieve the lowest expected relative overfitting error. GM(D0 ) helps us define G: G(h | D0 ) := P

ζ + GM(D0 ) (h | D0 ) , h∈H (ζ + GM(D0 ) (h | D0 ))

(A.19)

/ M(D0 ), and where ζ is a small constant where GM(D0 ) (h | D0 ) = 0 for h ∈ ensuring positive probability G(h | D0 ) for all hypotheses h. To appreciate the importance of the prior p( f ) in the definition of GM(D0 ) (see also equation A.24), in what follows, we will focus on the noise-free case. Let p(D0 | f ) be equal to

A.1.11. The Special Case of Noise-Free Data. δ(D0 ⊂ f ) (up to a normalizing constant): δ(D0 ⊂ f )p( f ) . p( f | D0 ) = P f ∈T,D0 ⊂ f p( f )

(A.20)

δ(D\D0 ⊂ f ) Assume p(D \ D0 | f ) = P

. Let F be the number of elements P δ(D\D0 ⊂ f ) in X. p(D \ D0 | f ) = . We expand D\D0 p(D \ D0 | f )E(D \ D0 , h) 2F−n from equation A.15: D\D0

1 2F−n

X

δ(D\D0 ⊂ f )

E(D \ D0 , h) =

D\D0 ⊂ f

= =

1 2F−n 1 2F−n

X

X

E((x, y), h)

(A.21)

D\D0 ⊂ f (x,y)∈D\D0

X

E((x, y), h)

(x,y)∈ f \D0

¶ F−n µ X F−n−1 i=1

i−1

1 E( f \ D0 , h) . 2

P Here E((x, y), h) = ky − h(x)k2 , E( f \ D0 , h) = (x,y)∈ f \D0 ky − h(x)k2 , and PF−n ¡F−n−1¢ = 2F−n−1 . The factor 1/2 results from considering the mean i=1 i−1

Flat Minima

31

test set error (where the test set is drawn from f ), whereas E( f \ D0 , h) is the maximal test set error (obtained by using a maximal test set). From equations A.15 and A.21, we obtain the expected test set error E(·, h) for some h, given D0 : X

p(D \ D0 | D0 )E(D \ D0 , h) =

D\D0

1X p( f | D0 )E( f \ D0 , h) . 2 f ∈T

(A.22)

From equations A.22 and A.12, we obtain the expected Ero (·,D0 ,M(D0 ),h): Ero (·, D0 , M(D0 ), h) ≈

1 2

GM(D0 ) (h | D0 )

P f ∈T

(A.23) p( f | D0 )E( f \ D0 , h) .

For GM(D0 ) (h | D0 ) we obtain in this noise-free case GM(D0 ) (h | D0 )  =





X

δ argminh0 ∈M(D0 ) 



(A.24)

p( f | D0 )E( f \ D0 , h0 ) − h .

f ∈T

The lowest expected test set error is measured by D0 , h). See equation A.22.

1 2

P

f ∈T

p( f | D0 )E( f \

A.1.12. Noisy Data and Noise-Free Data: Conclusion. For both the noisefree and the noisy case, equation A.14 shows that given D0 and h, the expected test set error depends on prior target probability p( f ). A.1.13. Choice of Prior Belief. Now we select some p( f ), our prior belief in target f . We introduce a formalism similar to Wolpert’s (1994a ). p( f ) is defined as the probability of obtaining f = net(w) by choosing a w randomly according to p(w). R Let us first look at Wolpert’s formalism: p( f ) = dwp(w)δ(net(w) − f ). By restricting W to Winj , he obtains an injective function netinj : Winj → NET : netinj (w) = net(w) , which is net restricted to Winj . netinj is surjective (because net is surjective): Z δ(netinj (w) − f ) |det net0inj (w)|dw p(w) (A.25) p( f ) = |det net0inj (w)| W Z δ(g − f ) = dg p(net−1 inj (g)) |det net0inj (net−1 NET inj (g))| =

p(net−1 inj ( f )) |det net0inj (net−1 inj ( f ))|

,

32

Sepp Hochreiter and Jurgen ¨ Schmidhuber

where |det net0inj (w)| is the absolute Jacobian determinant of netinj , evaluated at w. If there is a locally flat net(w) = f (flat around w), then p( f ) is high. However, we prefer to follow another path. Our algorithm (flat minimum search) tends to prune a weight wi if net(w) is very flat in wi ’s direction. It prefers regions where det net0 (w) = 0 (where many weights lead to the same net function). Unlike Wolpert’s approach, ours distinguishes the probabilities of targets f = net(w) with det net0 (w) = 0. The advantage is that we do not only search for net(w) that are flat in one direction but for net(w) that are flat in many directions (this corresponds to a higher probability of the corresponding targets). Define net−1 (g) := {w ∈ W | net(w) = g}

(A.26)

and X

p(net−1 (g)) :=

p(w) .

(A.27)

w∈net−1 (g)

We have P p( f ) = P

g∈NET f ∈T

P

p(net−1 (g))δ(g − f )

g∈NET

p(net−1 (g))δ(g

− f)

= P

p(net−1 ( f )) . (A.28) −1 f ∈T p(net ( f ))

net partitions W into equivalence classes. To obtain p( f ), we compute the probability of w being in the equivalence class {w | net(w) = f }, if randomly chosen according to p(w). An equivalence class corresponds to a net function; net maps all w of an equivalence class to the same net function. A.1.14. Relation to FMS Algorithm. FMS (from Section 3) works locally in weight space W. Let w0 be the actual weight vector found by FMS (with h = net(w0 )). Recall the definition of GM(D0 ) (h | D0 ) (see equations A.17 and A.18); we want to find a hypothesis h that best approximates those f with large p( f ) (the test data have a high probability of being drawn from such targets). We will see that those f = net(w) with flat net(w) locally have high probability p( f ). Furthermore we will see that a w0 close to w with flat net(w) has flat net(w0 ) too. To approximate such targets f , the only thing we can do is find a w0 close to many w with net(w) = f and large p( f ). To justify this approximation (see definition of p( f | D0 ) while recalling that h ∈ GM(D0 ) ), we assume that the noise has mean 0 and that small noise is more likely than large noise (e.g., gaussian, Laplace, Cauchy distributions). To restrict p( f ) = p(net(w)) to a local range in W, we define regions of equal net functions F(w) = {w¯ | ∀τ 0 ≤ τ ≤ 1, w + τ (w¯ − w) ∈ W : net(w) = net(w + τ (w¯ − w))}. Note that F(w) ⊂ net−1 (net(w)). If net(w) is flat along long distances in many directions w¯ − w, then F(w) has many elements. Locally in weight space, at w0 with h = net(w0 ), for γ > 0 we define: if the

Flat Minima

33

¯ = f } exists, then minimum w = argminw¯ {kw¯ − w0 k | kw¯ − w0 k < γ , net(w) pw0 ,γ ( f ) = c p(F(w)), where c is a constant. If this minimum does not exist, then pw0 ,γ ( f ) = 0. pw0 ,γ ( f ) locally approximates p( f ). During search for w0 (corresponding to a hypothesis h = net(w0 )), to decrease locally the expected test set error (see equation A.15), we want to enter areas where many large F(w) are near w0 in weight space. We wish to decrease the test set error, which is caused by drawing data from highly probable targets f (those with large pw0 ,γ ( f )). We do not know, however, which w’s are mapped to target’s f by net(·). Therefore, we focus on F(w) (w near w0 in weight space), instead of pw0 ,γ ( f ). Assume kw0 − wk is small enough to allow for a Taylor expansion, and that net(w0 ) is flat in direction (w¯ − w0 ): net(w) = net(w0 + (w − w0 )) = net(w0 ) + ∇net(w0 )(w − w0 ) 1 + (w − w0 )H(net(w0 ))(w − w0 ) + · · · , 2 where H(net(w0 )) is the Hessian of net(·) evaluated at w0 , ∇net(w)(w¯ − w0 ) = ∇net(w0 )(w¯ − w0 ) + O(w − w0 ), and (w¯ − w0 )H(net(w))(w¯ − w0 ) = (w¯ − w0 )H(net(w0 ))(w¯ − w0 ) + O(w − w0 ) (analogously for higher-order derivatives). We see that in a small environment of w0 , there is flatness in direction (w¯ − w0 ) too. And, if net(w0 ) is not flat in any direction, this property also holds within a small environment of w0 . Only near w0 with flat net(w0 ), there may exist w with large F(w). Therefore, it is reasonable to search for a w0 with h = net(w0 ), where net(w0 ) is flat within a large region. This means searching for the h determined by GM(D0 ) (· | D0 ) of equation A.17. Since h ∈ M(D0 ), E(D0 , net(w0 )) ≤ Etol holds, we search for a w0 living within a large connected region, where for all w within this region P E(net(w0 ), net(w), X) = x∈X knet(w0 )(x) − net(w)(x)k2 ≤ ², where ² is defined in Section 2. To conclude, we decrease the relative overfitting error and the underfitting error by searching for a flat minimum (see the definition of flat minima in Section 2). A.1.15. Practical Realization of the Gibbs Variant. 1. Select α and Etol (α, β), thus implicitly choosing β. 2. Compute the set M(D0 ). 3. Assume we know how data are obtained from target f ; that is, we know p(D0 | f ), p(D\D0 | f ), and the prior p( f ). Then we can compute GM(D0 ) (· | D0 ) and G(· | D0 ). 4. Start with β = 0 and increase β until equation A.8 holds. Now we know the β from the implicit choice above.

34

Sepp Hochreiter and Jurgen ¨ Schmidhuber

5. Since we know all we need to compute p(h | D0 ), select some h according to this distribution. A.1.16. Three Comments on Certain FMS Limitations. 1. FMS only approximates the Gibbs variant given by the definition of GM(D0 ) (h | D0 ) (see equations A.17 and A.18). We only locally approximate p( f ) in weight space. If f = net(w) is locally flat around w, then there exist units or weights that can be given with low precision (or can be removed). If there are other weights wi with net(wi ) = f , then one may assume that there are also points in weight space near such wi where weights can be given with low precision (think of, for example, symmetrical exchange of weights and units). We assume the local approximation of p( f ) is good. The most probable targets represented by flat net(w) are approximated by a hypothesis h, which is also represented by a flat net(w0 ) (where w0 is near w in weight space). To allow for approximation of net(w) by net(w0 ), we have to assume that the hypothesis set H is dense in the target set T. If net(w0 ) is flat in many directions, then there are many net(w) = f that share this flatness and are well approximated by net(w0 ). The only reasonable thing FMS can do is to make net(w0 ) as flat as possible in a large region around w0 , to approximate the net(w) with large prior probability (recall that flat regions are approximated by axis-aligned boxes, as discussed in Section 7.2). This approximation is fine if net(w0 ) is smooth enough in “unflat” directions (small changes in w0 should not result in very different net functions). 2. Concerning point 3 above. p( f | D0 ) depends on p(D0 | f ) (how the training data are drawn from the target; see equation A.13). GM(D0 ) (h | D0 ) depends on p( f | D0 ) and p(D\D0 | f ) (how the test data are drawn from the target). Since we do not know how the data are obtained, the quality of the approximation of the Gibbs algorithm may suffer from noise that has not mean 0, or from large noise being more probable than small noise. Of course, if the choice of prior belief does not match the true target distribution, the quality of GM(D0 ) (h | D0 )’s approximation will suffer as well. 3. Concerning point 5 above. FMS outputs only a single h instead of p(h | D0 ). This issue is discussed in Section 7.3. A.1.17. Conclusion. Our FMS algorithm from Section 3 only approximates the Gibbs algorithm variant. Two important assumptions are made. The first is that an appropriate choice of prior belief has been made. The second is that the noise on the data is not too “weird” (mean 0, small noise more likely). The two assumptions are necessary for any algorithm based

Flat Minima

35

on an additional error term besides the training error. The approximations are that p( f ) is approximated locally in weight space, and flat net(w) are approximated by flat net(w0 ) with w0 near w’s. Our Gibbs variant takes into account that FMS uses only X0 for computing flatness. A.2. Why Does the Hessian Decrease? This section shows that secondorder derivatives of the output function vanish during flat minimum search. This justifies the linear approximations in Section 4. A.2.1. Intuition. We show that the algorithm tends to suppress the following values: unit activations, first-order activation derivatives, and the sum of all contributions of an arbitrary unit activation to the net output. Since weights, inputs, activation functions, and their first- and second-order derivatives are bounded, the entries in the Hessian decrease where the corresponding |δwij | increase. A.2.2. Formal Details. We consider a strictly layered feedforward network with K output units and g layers. We use the same activation function f for all units. For simplicity, in what follows we focus on a single input vector xp . xp (and occasionally w itself) will be notationally suppressed. We have ( yj ∂yl 0 = f (sl ) P ∂ym ∂wij m wlm wij

)

for i = l for i 6= l

,

(A.29)

P where ya denotes the activation of the ath unit, and sl = m wlm ym . The last term of equation 3.1 (the “regulator”) expresses output sensitivity (to be minimized) with respect to simultaneous perturbations of all weights. “Regulation” is done by equalizing the sensitivity of the output units with respect to the weights. The “regulator” does not influence the same particular units or weights for each training example. It may be ignored for the purposes of this section. Of course, the same holds for the first (constant) term in equation 3.1. We are left with the second term. With equation A.29, we obtain: X

log

i,j

X k

=2

Ã

∂ok ∂wij

!2

X

(fan-in of unit k) log | f 0 (sk )|

unit k in the gth layer

+2

X unit j in the (g−1)th layer

(fan-out of unit j) log |y j |

36

Sepp Hochreiter and Jurgen ¨ Schmidhuber

X

+

(fan-in of unit j) log

unit j in the (g−1)th layer

f 0 (sk )wkj

¢2

k

X

+2

X¡

(fan-in of unit j) log | f 0 (sj )|

unit j in the (g−1)th layer

X

+2

(fan-out of unit j) log |y j |

unit j in the (g−2)th layer

X

+

(fan-in of unit j) log

unit j in the (g−2)th layer

Ã f 0 (sk )

×

X

X k

!2 f 0 (sl )wkl wlj

l

X

+2

(fan-in of unit j) log | f 0 (sj )|

unit j in the (g−2)th layer

X

+2

(fan-out of unit j) log |y j |

unit j in the (g−3)th layer

X

+

log

i,j, where unit i in a layer <(g−2)

Ã ×

0

f (sk )

X l1

0

f (sl1 )wkl1

X l2

X k

∂yl2 wl1 l2 ∂wij

!2 . (A.30)

Let us have a closer look at this equation. We observe: 1. Activations of units decrease in proportion to their fan-outs. 2. First-order derivatives of the activation functions decrease in proportion to their fan-ins. P P P P 3. A term of the form k ( f 0 (sk ) l1 f 0 (sl1 )wkl1 l2 · · · lr f 0 (slr )wlr−1 lr wlr j )2 expresses the sum of unit j’s squared contributions to the net output. Here r ranges over {0, 1, . . . , g−2}, Pand unit j is in the (g−1−r)th layer (for the special case r = 0, we get k ( f 0 (sk )wkj )2 ). These terms also decrease in proportion to unit j’s fan-in. Analogously, equation A.30 can be extended to the case of additional layers. A.2.3. Comment. Let us assume that f 0 (sj ) = 0 and f (sj ) = 0 is “difficult to achieve” (can be achieved only by fine-tuning all weights on connections to unit j). Instead of minimizing | f (sj )| or | f 0 (sj )| by adjusting the net input of unit j (this requires fine-tuning of many weights), our algorithm prefers pushing weights wkl on connections to output units toward zero (other weights are less affected). On the other hand, if f 0 (sj ) = 0 and f (sj ) = 0

Flat Minima

37

is not difficult to achieve, then, unlike weight decay, our algorithm does not necessarily prefer weights close to zero. Instead, it prefers (possibly very strong) weights that push f (sj ) or f 0 (sj ) toward zero (e.g., with sigmoid units active in [0,1]: strong inhibitory weights are preferred; with gaussian units: high absolute weight values are preferred). See the experiment in Section 5.2. A.2.4. How Does This Influence the Hessian? The entries in the Hessian corresponding to output ok can be written as follows: ∂ 2 ok f 00 (sk ) ∂ok ∂ok = 0 ∂wij ∂wuv ( f (sk ))2 ∂wij ∂wuv Ã ! X ∂ 2 yl ∂y j ∂yv 0 ¯ ¯ + f (sk ) wkl + δik + δuk , ∂wij ∂wuv ∂wuv ∂wij l

(A.31)

where δ¯ is the Kronecker delta. Searching for big boxes, we run into regions of acceptable minima with ok ’s close to target (Section 2). Thus, by scaling f 00 (s ) the targets, ( f 0 (s k))2 can be bounded. Therefore, the first term in equation A.31 k decreases during learning. According to the analysis above, the first-order derivatives in the second term of equation A.31 are pushed toward zero. So are the wkl of the sum in the second term of equation A.31. The only remaining expressions of interest are second-order derivatives ∂ 2 yl

of units in layer (g − 1). The ∂wij ∂wuv are bounded if the weights, the activation functions, their first- and second-order derivatives, and the inputs are bounded. This is indeed the case, as will be shown for networks with one or two hidden layers: In case 1, for unit l in a single hidden layer (g = 3), we obtain ¯ ¯ ¯ ∂ 2 yl ¯ ¯ ¯ (A.32) ¯ = |δ¯li δ¯lu f 00 (sl )y j yv | < C1 , ¯ ¯ ∂wij ∂wuv ¯ where y j , yv are the components of an input vector xp , and C1 is a positive constant. In case 2, for unit l in the third layer of a net with two hidden layers (g = 4), we obtain ¯ ¯ ¯ ∂ 2 yl ¯ ¯ ¯ ¯ = | f 00 (sl )(wli y j + δ¯il y j )(wlu yv + δ¯ul yv ) ¯ ¯ ∂wij ∂wuv ¯ + f 0 (sl )(wli δ¯iu f 00 (si )y j yv + δ¯il δ¯uj f 0 (sj )yv + δ¯ul δ¯iv f 0 (sv )y j )| < C2 ,

(A.33)

38

Sepp Hochreiter and Jurgen ¨ Schmidhuber

where C2 is a positive constant. Analogously, the boundedness of secondorder derivatives can be shown for additional hidden layers. k decrease A.2.5. Conclusion. As desired, our algorithm makes the Hij,uv where |δwij | or |δwuv | increase.

A.3. Explicit Derivative of Equation 3.1. We explicitly compute the derivatives of equation 3.1. For simplicity, in what follows we focus on a single input vector xp . Again, xp (and occasionally w itself) will be notationally suppressed. The derivative of the right-hand side of equation 3.1 is: ∂B(w,xp ) ∂wuv

=

P

∂ok ∂ 2 ok k ∂wij ∂wij ∂wuv 2 ∂om m ∂wij

P

P ³

i,j

´



 ¯ ¯ P ³ ∂om ´2 ∂ok P ∂om ¯ k ¯ ∂o  ³ ´ ∂ 2 ok − ∂wij P m ∂wij P q ¯ ∂wij ¯ P sign ∂ok ∂wij ∂wuv m ∂wij P ∂om 2 i,j  ∂wij µ ¶ 32 k i,j ³ ´ 2   ) ( P ∂w m m

ij

+L



2

m

∂o ∂wij

 ∂ 2 om ∂wij ∂wuv

    .

¯ ¯ ¯ ∂ok ¯ ¯ ∂wij ¯  P P   r k i,j P ³ ´2  m

∂om ∂wij

(A.34)

To compute equation 3.2, we need ∂ok ∂w

∂B(w,xp ) ∂



³

ij ´ = P ³ ∂om ´2 ∂ok

∂wij



m

∂wij





¯ m¯ ¯ 2 ∂om ∂ok ∂om ³ ´ δ¯mk P ( ∂w ∂o ¯ ) − P  ∂wij ∂wij ¯ P  q ¯ ∂w m ij ∂om lr  sign  ∂wij µ ¶ 32 m l,r P ³ ´ ¯ 2 m 2  ∂o P ) ( ¯ m ¯ ∂wlr m

+L

P m

 

P l,r

2

¯ m

∂o ∂wij

    ,

(A.35)

m | ∂o | ∂wlr

qP ¡ ¢  ¯ 2 ∂om ¯ m

∂wlr

where δ¯ is the Kronecker delta. Using the nabla operator and equation A.35, we can compress equation A.34: ¶ µ X Hk ∇ ∂ok B(w, xp ) , (A.36) ∇uv B(w, xp ) = k

∂wij

Flat Minima

39

where Hk is the Hessian of the output ok . Since the sums over l, r in equation A.35 need to be computed only once (the results are reusable for all i, j), ∇ ∂ok B(w, xp ) can be computed in O(L) time. The product of the Hessian and ∂wij

a vector can be computed in O(L) time. With constant number of output units, the computational complexity of our algorithm is O(L). The long version of this paper (available on the World-Wide Web; see our home pages) contains pseudo-code of an efficient implementation. It is based on fast multiplication of the Hessian and a vector due to Pearlmutter (1994) and Møller (1993). Acknowledgments We thank David Wolpert and several anonymous referees for numerous comments that helped to improve previous drafts of this article. This work was supported by DFG grant SCHM 942/3-1 from Deutsche Forschungsgemeinschaft. References Akaike, H. 1970. Statistical predictor identification. Ann. Inst. Statist. Math. 22, 203–217. Amari, S., and Murata, N. 1993. Statistical theory of learning curves under entropic loss criterion. Neural Computation 5(1), 140–153. Ash, T. 1989. Dynamic node creation in backpropagation neural networks. Connection Science 1(4), 365–375. Bishop, C. M. 1993. Curvature-driven smoothing: A learning algorithm for feed-forward networks. IEEE Transactions on Neural Networks 4(5), 882–884. Buntine, W. L., and Weigend, A. S. 1991. Bayesian back-propagation. Complex Systems 5, 603–643. Carter, M. J., Rudolph, F. J., and Nucci, A. xJ. 1990. Operational fault tolerance of CMAC networks. In Advances in Neural Information Processing Systems 2, pp. 340–347. Morgan Kaufmann, San Mateo, CA. Craven, P., and Wahba, G. 1979. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numer. Math. 31, 377–403. Eubank, R. L. 1988. Spline smoothing and nonparametric regression. In SelfOrganizing Methods in Modeling, S. Farlow, ed. Marcel Dekker, New York. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning algorithm. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 525–532. Morgan Kaufmann, San Mateo, CA. Golub, G., Heath, H., and Wahba, G. 1979. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21, 215–224. Guyon, I., Vapnik, V., Boser, B., Bottou, L., and Solla, S. A. 1992. Structural risk minimization for character recognition. In Advances in Neural Information

40

Sepp Hochreiter and Jurgen ¨ Schmidhuber

Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 471–479. Morgan Kaufmann, San Mateo, CA. Hanson, S. J., and Pratt, L. Y. 1989. Comparing biases for minimal network construction with backpropagation. In Advances in Neural Information Processing Systems 1, D. S. Touretzky, ed., pp. 177–185. Morgan Kaufmann, San Mateo, CA. Hassibi, B., and Stork, D. G. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 164–171. Morgan Kaufmann, San Mateo, CA. Hastie, T. J., and Tibshirani, R. J. 1990. Generalized additive models. Monographs on Statisics and Applied Probability 43. Hinton, G. E., and van Camp, D. 1993. Keeping neural networks simple. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pp. 11–18. Springer-Verlag, Berlin. Hochreiter, S. and Schmidhuber, J. 1994. Flat minimum search finds simple nets. Tech. Rep. FKI-200-94, Fakult¨at fur ¨ Informatik, Technische Universit¨at Mun¨ chen. Hochreiter, S., and Schmidhuber, J. 1995. Simplifying nets by discovering flat minima. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, eds., pp. 529–536. MIT Press, Cambridge MA. Holden, S. B. 1994. On the theory of generalization and self-structuring in linearly weighted connectionist networks. Ph.D. thesis, Cambridge University. Kerlirzin, P., and Vallet, F. 1993. Robustness in multilayer perceptrons. Neural Computation 5(1), 473–482. Krogh, A., and Hertz, J. A. 1992. A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 950–957. Morgan Kaufmann, San Mateo, CA. Kullback, S. 1959. Statistics and Information Theory. John Wiley, New York. LeCun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 598–605. Morgan Kaufmann, San Mateo, CA. Levin, E., Tishby, N., and Solla, S. 1990. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE 78(10), 1568–1574. Levin, A. U., Leen, T. K., and Moody, J. E. 1994. Fast pruning using principal components. In Advances in Neural Information Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 35–42. Morgan Kaufmann, San Mateo, CA. MacKay, D. J. C. 1992a. Bayesian interpolation. Neural Computation 4, 415–447. MacKay, D. J. C. 1992b. A practical Bayesian framework for backprop networks. Neural Computation 4, 448–472. Matsuoka, K. 1992. Noise injection into inputs in back-propagation learning. IEEE Transactions on Systems, Man, and Cybernetics 22(3), 436–440. Minai, A. A., and Williams, R. D. 1994. Perturbation response in feedforward networks. Neural Networks 7(5), 783–796.

Flat Minima

41

Møller, M. F. 1993. Exact calculation of the product of the Hessian matrix of feedforward network error functions and a vector in O(N) time. Tech. Rep. PB-432, Computer Science Department, Aarhus University, Denmark. Moody, J. E. 1989. Fast learning in multi-resolution hierarchies. In Advances in Neural Information Processing Systems 1, D. S. Touretzky, ed., pp. 29–39. Morgan Kaufmann, San Mateo, CA. Moody, J. E. 1992. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 847–854. Morgan Kaufmann, San Mateo, CA. Moody, J. E., and Utans, J. 1992. Principled architecture selection strategies for neural networks: Application to corporate bond rating prediction. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 847–854. Morgan Kaufmann, San Mateo, CA. Mosteller, F., and Tukey, J. W. 1968. Data analysis, including statistics. In Handbook of Social Psychology, G. Lindzey and E. Aronson, eds., Vol. 2. AddisonWesley, Reading, MA. Mozer, M. C., and Smolensky, P. 1989. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in Neural Information Processing Systems 1, D. S. Touretzky, ed., pp. 107–115. Morgan Kaufmann, San Mateo, CA. Murray, A. F., and Edwards, P. J. 1993. Synaptic weight noise during MLP learning enhances fault-tolerance, generalisation and learning trajectory. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 491–498. Morgan Kaufmann, San Mateo, CA. Neti, C., Schneider, M. H., and Young, E. D. 1992. Maximally fault tolerant neural networks. In IEEE Transactions on Neural Networks 3, pp. 14–23. Nowlan, S. J., and Hinton, G. E. 1992. Simplifying neural networks by soft weight sharing. Neural Computation 4, 173–193. Opper, M., and Haussler, D. 1991. Calculation of the learning curve of Bayes optimal classification algorithm for learning a perceptron with noise. In Computational Learning Theory: Proceedings of the Fourth Annual Workshop. Morgan Kaufmann, San Mateo, CA. Pearlmutter, B. A. 1994. Fast exact multiplication by the Hessian. Neural Computation 6(1), 147–160. Pearlmutter, B. A., and Rosenfeld, R. 1991. Chaitin-Kolmogorov complexity and generalization in neural networks. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 925– 931. Morgan Kaufmann, San Mateo, CA. Refenes, A. N., Francis, G., and Zapranis, A. D. 1994. Stock performance modeling using neural networks: A comparative study with regression models. Neural Networks 7, 375–388. ¨ Rehkugler, H., and Poddig, T. 1990. Statistische Methoden versus Kunstliche Neuronale Netzwerke zur Aktienkursprognose. Tech. Rep. 73, University Bamberg, Fakult¨at Sozial- und Wirtschaftswissenschaften. Rissanen, J. 1978. Modeling by shortest data description. Automatica 14, 465–471.

42

Sepp Hochreiter and Jurgen ¨ Schmidhuber

Schmidhuber, J. H. 1994a. Discovering problem solutions with low Kolmogorov complexity and high generalization capability. Tech. Rep. FKI-194-94, Fakult¨at fur ¨ Informatik, Technische Universit¨at Munchen. ¨ Short version in Machine Learning: Proceedings of the Twelfth International Conference, A. Prieditis and S. Russell, eds., pp. 488-496, Morgan Kaufmann, San Mateo, 1995. Schmidhuber, J. H. 1994b. On learning how to learn learning strategies. Tech. Rep. FKI-198-94, Fakult¨at fur ¨ Informatik, Technische Universit¨at Munchen. ¨ Schmidhuber, J. 1996. A general method for multi-agent learning and incremental self-improvement in unrestricted environments. In Evolutionary Computation: Theory and Applications, X. Yao, ed., Scientific Publ. Co., Singapore. Shannon, C. E. 1948. A mathematical theory of communication (parts I and II). Bell System Technical Journal 27, 379–423. Stone, M. 1974. Cross-validatory choice and assessment of statistical predictions. Roy. Stat. Soc. 36, 111–147. Vapnik, V. 1992. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 831–838. Morgan Kaufmann, San Mateo, CA. Wallace, C. S., and Boulton, D. M. 1968. An information theoretic measure for classification. Computer Journal 11(2), 185–194. Wang, C., Venkatesh, S. S., and Judd, J. S. 1994. Optimal stopping and effective machine complexity in learning. In Advances in Neural Information Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 303–310. Morgan Kaufmann, San Mateo, CA. Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1991. Generalization by weight-elimination with application to forecasting. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 875–882. Morgan Kaufmann, San Mateo, CA. White, H. 1989. Learning in artificial neural networks: A statistical perspective. Neural Computation 1(4), 425–464. Williams, P. M. 1994. Bayesian regularisation and pruning using a Laplace prior. Tech. Rep., School of Cognitive and Computing Sciences, University of Sussex, Falmer, Brighton. Wolpert, D. H. 1994a. Bayesian backpropagation over i-o functions rather than weights. In Advances in Neural Information Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 200–207. Morgan Kaufmann, San Mateo, CA. Wolpert, D. H. 1994b. The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. Tech. Rep. SFI-TR-03-123, Santa Fe Institute, NM.

Received January 5, 1995; accepted March 19, 1996.

NOTE

Communicated by Raphael Ritz

Lyapunov Functions for Neural Nets with Nondifferentiable Input-Output Characteristics Jianfeng Feng Mathematisches Institut, University of Munich, D-80333 Munich, Germany and Laboratory of Biomathematics, The Babraham Institute, Cambridge CB2 4AT, UK

I construct Lyapunov functions for asynchronous dynamics and synchronous dynamics of neural networks with nondifferentiable input-output characteristics.

1 Introduction A central concern in designing a parallel, distributed neural network is the global stability of the collective network state when many parts of the system are free to change their state simultaneously (Albeverio et al. 1995; Golden 1993; Goles and Martinez 1990; Grossberg 1988; Hopfield 1982, 1984; Herz and Marcus 1993; Marcus and Westervelt 1989). As an example, attractor networks configured to store fixed patterns may be globally stable when the states of single “neurons” are updated one at a time—forgoing parallelism altogether—but may possess a large number of spurious oscillatory modes when all neurons are updated in parallel. In particular, all of the strong global stability results for sequential or distributed updating are constrained by the requirement that the input-output characteristic f of a network is an increasing differentiable sigmoid function. However, there are many models of neural networks using a nondifferentiable characteristic f . Such models include Linsker’s model and the Brain-State-in-a-Box (BSB) model (Feng et al. 1995; Feng et al. 1996; Feng and Tirozzi 1995, 1996; Hui and Zak 1992; Linsker 1986). This note reports that a Lyapunov function for neural networks with asynchronous or synchronous dynamics but nondifferentiable input-output characteristics takes the same form as in the differentiable characteristic case. The main idea in my rigorous proof is that instead of employing the Taylor expansion, which requires the differentiability of the input-output characteristic, I use the Legendre-Fenchel transformation (Goles and Martinez 1990). The conclusions for Lyapunov functions can be extended to the same model with distributed dynamics (Herz and Marcus 1993; Marcus and Westervelt 1989). Neural Computation 9, 43–49 (1997)

c 1997 Massachusetts Institute of Technology °

44

Jianfeng Feng

2 Asynchronous Dynamics An asynchronous dynamics is a composition of two dynamics. First we P select a neuron i from (1, . . . , N) with probability pi > 0, i pi = 1 and the state Xi (t) of the ith neuron at time t is updated to the new state according to   N X aij rj Xj (t) + bi  Xi (t + 1) = f  j=1

but keep all other states unchanged: Xj (t + 1) = Xj (t), j 6= i. Here A = (aij , i, j = 1, . . . , N) is a symmetric N × N matrix, ri ≥ 0, i = 1, . . . , N and f (x) a continuous function that is strictly increasing if α ≤ x ≤ β and f (x) = α if x ≤ α, f (x) = β if x ≥ β. Note that f is not differentiable, and this is the difficulty for proving the global stability of the asynchronous dynamics. Without loss of generality we suppose that α < 0 = f (0) < β. A Lyapunov function for the stochastic process X(t) = (Xi (t), i = 1, . . . , N) is called a supermartingale (Chow and Teicher 1988), defined as follows: Definition 1.

A stochastic process L(X(t)) is called a supermartingale if

E(L(X(t + 1)) | Ft ) ≤ L(X(t))

(2.1)

where L is a measurable function, Ft = σ (X(1), . . . , X(t)) the sigma algebra generated by X(s), s = 1, . . . , t and E(· | Ft ) is the conditional expectation with respect to the sigma algebra Ft . The meaning of the definition of a supermartingale is clear. In average, that is, E(L(X(t + 1)) | Ft ) the function L decreases with respect to its dynamic evolution. Certainly when the dynamics is deterministic, equation 2.1 reduces to the condition of a usual Lyapunov function L(X(t + 1)) ≤ L(X(t)). Define L(X(t)) =

N Z X j=1

Xj (t) 0

rj f −1 (y) dy −

N N X 1X aji Xj (t)Xi (t)rj ri − rj bj Xj (t). 2 j,i=1 j=1

Notice that when f is differentiable, L coincides with the Lyapunov function used by Marcus and Westervelt (1989) to study the time evolution of iterated map networks, with the function in Herz and Marcus (1993) for distributed dynamics, and with Hopfield’s energy function for binary neurons and continuous neurons (Feng and Tirozzi 1995; Grossberg 1988; Hopfield 1982, 1984). Here another question arises: Does the asynchronous dynamics reach the set of attractors within finite time? In other words, is the system

Lyapunov Functions for Neural Nets

45

dissipative? For this purpose we introduce τ (²) := inf{t : ρ(X(t), A) ≤ ²}, which is the first time for the process X(t) to enter the ²-neighborhood of the attractors A of the asynchronous dynamics X(t), where ρ is the Euclidean distance on RN . Theorem 1. L(X(t)) is a supermartingale and ∀² > 0, τ (²) < ∞ provided that aii ≥ 0, i = 1, . . . , N. Proof. In terms of the definition of the asynchronous dynamics, which only changes state of neuron i with probability pi at a time step, we have E(L(X(t + 1)) | Ft ) =



1 1X aij ri rj Xi (t)Xj (t) − akk rk2 Xk2 (t + 1) 2 2 k=1 i,j6=k X Z Xi (t) X ri bi Xi (t) + ri f −1 (y) dy − N X

pk  −

i6=k

i6=k

0

− bk rk Xk (t + 1) N X − aik ri rk Xk (t + 1)Xi (t) i=1 i6=k

Z

#

Xk (t+1)

+

rk f

0

−1

(y) dy .

(2.2)

Therefore E(L(X(t + 1)) | Ft ) − L(X(t))  N N X 1  X pk − ajk rj rk (Xk (t + 1) − Xk (t))Xj (t) − rk2 akk (Xk2 (t + 1) − Xk2 (t)) = 2 j=1 k=1 j6=k

Z

Xk (t+1)

+ 0

rk f −1 (y) dy −

Z 0



 Xk (t)

 rk f −1 (y) dy − bk rk Xk (t + 1) + bk rk Xk (t)

N N X X 1 = pk − ajk rj rk (Xk (t + 1) − Xk (t))Xj (t) − rk2 akk (Xk (t + 1) − Xk (t))2 2 j=1 k=1  Z Xk (t+1) Z Xk (t) −1 −1 + rk f (y) dy − rk f (y) dy − bk rk Xk (t + 1) + bk rk Xk (t) . 0

0

46

Jianfeng Feng

Figure 1: An explanation of the Legendre-Fenchel transformation. It is easily RX RY RX R f (X) −1 f (y) dy. seen that XY = 0 f (x) dx + 0 f −1 (y) dy = 0 f (x) dx + 0

In terms of the Legendre-Fenchel transformation, which in our case is fairly straightforward (see Fig. 1), Z

Xk (t+1)

f

−1

Z (y) dy +

0

uk (t)

0

f (x) dx = uk (t)Xk (t + 1),

t ≥ 0,

where uk (t) :=

N X

akj rj rk Xj (t) + bk

j=1

we have that E(L(X(t + 1)) | Ft ) − L(X(t)) " Z Z N uk (t) X p k rk − f (x) dx + ≤ k=1

0

0

uk (t−1)

# f (x) dx + Xk (t)(uk (t) − uk (t − 1))

≤ 0 since the function f is a strictly increasing function. So the function L(X(t))

Lyapunov Functions for Neural Nets

47

is a Lyapunov function (supermartingale) of the dynamics X(t). Next we consider the second conclusion of the theorem. As long as X(t) 6∈ S² := {x; ρ(x, A) ≤ ², x ∈ RN }, g(X(t)) := E(L(X(t + 1) | Ft )) − L(X(t)) < 0. Hence there is a δ = δ(²) > 0 such that g(X(t)) < −δ for X(t) ∈ [α, β]N \ S² . Therefore the process M(t) = L(X(t)) + tδ is also a bounded supermartingale. According to Doob’s (Chow and Teicher 1988) theorem, M(τ ∧ t) is again a bounded supermartingale. From the convergence theorem for supermartingales (Chow and Teicher 1988), it follows that lim M(τ ∧ t) = M < ∞

t→∞

a.s.

Thus we conclude that lim M(τ ∧ t) = lim (L(X(τ ∧ t)) + δ · (τ ∧ t)) < ∞

t→∞

t→∞

a.s.

Note that L(X(τ ∧ t)) is itself a bounded supermartingale; therefore, limt→∞ (τ ∧ t) < ∞ a.s. which implies τ < ∞ a.s. 3 Synchronous Dynamics Consider the following synchronous dynamics: Y(t + 1) = F(AR(Y(t))0 + B), t = 0, . . . ,

(3.1)

where Y(t) = (Yi (t), i = 1, . . . , N) ∈ RN for (·)0 representing the transpose of a vector, R = (ri δij ) for δij = 1 if i = j and δij = 0 if i 6= j, ri ≥ 0, i, j = 1, . . . , N, B = (bi , i = 1, . . . , N) ∈ RN and F = ( f, . . . , f ). As one may expect the behavior of the synchronous dynamics defined by equation 3.1 is substantially different from that of the asynchronous dynamics. In the case of asynchronous dynamics, we know from Theorem 1 that the set of attractors of the dynamics are all fixed-point attractors. Here, however, in addition to stable fixed points, there are attractors of two-state limit cycles for the synchronous dynamics, which is the same situation as in the Hopfield model. Due to the compactness of the state space of the dynamics (equation 3.1) we assert that the synchronous dynamics (equation 3.1) is dissipative. The proof of the following theorem is omitted since it is similar to that of Theorem 1.

48

Jianfeng Feng

Theorem 2.

The function

V(Y(t)) = −

X

aij ri rj Yi (t)Yj (t + 1) −

i,j

+

XZ i

0

Yi (t)

ri f −1 (u) du +

X

ri bi (Yi (t) + Yi (t + 1))

XZ i

Yi (t+1)

ri f −1 (u) du

0

is a Lyapunov function of the synchronous dynamics. The set of all fixed-point attractors of the asynchronous or synchronous dynamics can be divided into two categories: saturated attractors and unsaturated attractors. It is reported that saturated attractors are most likely to be observed in numerical simulations in Linsker’s model and the BSB model and are investigated in Feng et al. (1995, 1996). 4 Conclusions I construct Lyapunov functions for a class of asynchronous and synchronous dynamics, which include many models in neural networks as a special case. For the two most commonly utilized dynamics of neural networks with discrete time, asynchronous and synchronous dynamics, my results indicate that in general it is impossible to define a Lyapunov function of the model under consideration depending on only one state. For the asynchronous dynamics, a restriction on the matrix or on the interaction between neurons is needed for the existence of a Lyapunov function, while there may exist two-state limit cycles for the synchronous dynamics. This note constructs Lyapunov functions of the asynchronous and synchronous dynamics with nondifferentiable characteristics and may help our understanding of the dynamic properties of related models. Acknowledgments This article was partially supported by the A. von Humboldt Foundation of Germany. I thank an anonymous referee for bringing the references Herz and Marcus (1993) and Marcus and Westervelt (1989) to my attention. References Albeverio, S., Feng, J., and Qian, M. 1995. Role of noises in neural networks. Phys. Rev. E. 52, 6593–6606. Chow, S. Y., and Teicher, H. 1988. Probability Theory. Springer-Verlag, New York. Feng, J., Pan, H., and Roychowdhury, V. P. 1995. A rigorous analysis of Linskertype Hebbian learning. In Advances in Neural Information Processing Systems

Lyapunov Functions for Neural Nets

49

7, G. Tesauro, D. S. Touretzky, and T. K. Leen, eds., pp. 319–326. MIT Press, Cambridge, MA. Feng, J., Pan, H., and Roychowdhury, V. P. 1996. On neurodynamics with limiter function and Linsker’s developmental model. Neural Computation 8(5), 1003– 1019. Feng, J., and Tirozzi, B. 1995. The SLLN for the free-energy of the Hopfield and spin glass model. Helvetica Physica Acta 68, 365–379. Feng, J., and Tirozzi, B. 1996. An application of the saturated attractor analysis to three typical models. In Cybernetic and Systems ’96, R. Trappl, ed., pp. 1102– 1107. World Scientific, Singapore. Golden, R. M. 1993. Stability and optimization analyses of the generalized BrainState-in-a-Box neural network model. Journal of Mathematical Psychology 37, 282–298. Goles, E., and Martinez, S. 1990. Neural and Automata Networks. Kluwer, Dordrecht. Grossberg, S. 1988. Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks 1, 17–61. Herz, A. V. M., and Marcus, C. M. 1993. Distributed dynamics in neural networks. Phys. Rev. E 47, 2155–2161. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 79, 2554–2558. Hopfield, J. J. 1984. Neurons with graded response have computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA 81, 3088–3092. Hui, S., and Zak, S. H. 1992. Dynamical analysis of the Brain-State-in-a-Box (BSB) neural models. IEEE Transactions on Neural Networks 3, 86–94. Linsker, R. 1986. From basic network principle to neural architecture (series). Proc. Natl. Acad. Sci. USA 83, 7508–7512, 8390–8394, 8779–8783. Marcus, C. M., and Westervelt, R. M. 1989. Dynamics of iterated map networks. Phys. Rev. A 40, 501–504.

Received October 23, 1995; accepted April 9, 1996.

Communicated by George Gerstein

Detecting Synchronous Cell Assemblies with Limited Data and Overlapping Assemblies Gary Strangman Department of Cognitive and Linguistic Sciences, Brown University, Providence, RI 02912 USA

Two statistical methods—cross-correlation (Moore et al. 1966) and gravity clustering (Gerstein et al. 1985)—were evaluated for their ability to detect synchronous cell assemblies from simulated spike train data. The two methods were first analyzed for their temporal sensitivity to synchronous cell assemblies. The presented approach places a lower bound on the amount of data required to detect a synchronous assembly. On average, both methods required the same minimum amount of recording time to detect significant pairwise correlations, but the gravity method exhibited less variance in the recording time. The precise length of recording depends on the consistency with which a neuron fires synchronously with the assembly but was independent of the assembly firing rate. Next, the statistical methods were tested with respect to their ability to differentiate two distinct assemblies that overlapped in time and space. Both statistics could adequately differentiate two overlapping synchronous assemblies. For cross-correlation, this ability deteriorates quickly when considering three or more simultaneously active, overlapping assemblies, whereas the gravity method should be more flexible in this regard. The work demonstrates the difficulty of detecting assembly phenomena from simultaneous neuronal recordings. Other statistical methods and the detection of other types of assemblies are also discussed.

1 Introduction Although electrophysiological recordings have been a part of neuroscience research for over a century (Caton 1875), only in the past 20 years has there been considerable technological growth in multiple simultaneous electrode recordings (Pickard and Welberry 1976; Gross et al. 1977; McNaughton et al. 1983; Wilson and McNaughton 1993; Nicolelis et al. 1993; Nordhausen et al. 1994). Recently it has even become feasible to record individual spike trains from over 100 neurons simultaneously (e.g., Wilson and McNaughton 1993; Falk et al. 1993). Such technological advances do more than simply facilitate data collection. With the data from many simultaneously recorded neurons, one can begin to ask so-called functional questions about the brain. Neural Computation 9, 51–76 (1997)

c 1997 Massachusetts Institute of Technology °

52

Gary Strangman

One critical functional question stems from Barlow (1972, 1992), who postulates that each cell is an independent representational unit. In contrast with this view, several other theorists (including James 1890; Hebb 1949; Palm 1982; Edelman 1987) have predicted that neurons depend on one another for representation; that is, their responses are organized into functional assemblies. Evidence is beginning to surface in favor of the latter hypothesis—that is, assembly-based neuronal organization. When Georgopoulos et al. (1986) recorded from monkeys performing reaching tasks, they found data consistent with arm movement direction being encoded across populations of direction-selective neurons. Using information theory, Gochin et al. (1994) provide evidence that neurons in inferotemporal cortex also use some type of distributed (i.e., assembly-based) coding scheme, in this case for visually presented shapes. Data in Nicolelis et al. (1993) suggest that distributed representations exist for whiskers in the rat somatosensory thalamus as well. And it has even been suggested that neural codes in monkey frontal cortex can be found only by investigating synchrony across ensembles of neurons (Vaadia et al. 1995). Though these results support the existence of cell assemblies, in almost all cases it remains to be demonstrated precisely how groups of neurons work together to form these functional units. That is, what particular type of distributed code or cell assembly is implemented in the brain? Although this question is unresolved, many hypotheses have been introduced. Perhaps the most straightforward distributed code is synchrony, where neurons of an assembly fire synchronized volleys of spikes. In this case, assembly membership is revealed as temporally correlated neural discharges. It is worth dwelling briefly on the synchronous assembly hypothesis because such assemblies will form the basis for the analysis presented in this chapter. The simplest possible synchronous cell assembly is one where all assembly-cell spike trains are identical; that is, every neuron in the assembly fires at precisely the same time as every other neuron in the assembly. For a slightly more general synchronous cell assembly, relax the temporal precision constraint somewhat and thus introduce temporal jitter into otherwise identical spike trains. An even more general definition also relaxes the identity constraint. Thus, individual neurons are allowed to discharge at times when the assembly as a whole is silent (e.g., random background spikes in the spike trains) and/or neurons can “skip” some proportion of assembly discharge times (a measure of consistency). As an alternative to synchronous assemblies, it is possible that phaselocked oscillations in neuronal firing could bind active neurons together into assemblies. Eckhorn et al. (1988) and Gray et al. (1989) have suggested this, among others. More complex assembly definitions are also possible, including lateral inhibition networks, local feedback loops (Kubie 1930), attractor networks (e.g., Anderson et al. 1977; Hopfield 1982), or other spatiotemporal activation patterns (e.g., synfire chains; Abeles 1982, 1991).

Detecting Synchronous Cell Assemblies

53

Assuming that one has hypothesized a particular assembly type and provided that adequate numbers of neurons have been sampled (Strangman 1996), the following question remains: What are the appropriate analytical tools to test whether the hypothesized assembly is present in the data? Although many tools exist, few have been evaluated for their abilities to find assembly phenomena. The goal of this paper is to evaluate two statistics toward this end. Before selecting such tools, however, an assembly type must be established because the assembly type determines the relevant parameters for statistical analysis. Synchronous assemblies were the chosen type of functional assembly for two main reasons: (1) synchrony is commonly suggested as an important parameter of brain function, and (2) synchronous assemblies are one of the few types of functional units that require no additional a priori knowledge about the neurons. Assembly hypotheses like local feedback loops, miniature classifier networks, cellular automata, or even simple lateral inhibition all require some knowledge about the role of each recorded neuron in the computation (inhibiting versus inhibited, etc.) in order to classify correctly neurons as belonging to one or another active assembly. Any analysis of this sort requires single-unit spike train data known a priori to contain an assembly. This precludes the use of electrophysiological data, and therefore the required data were generated by a computer model. The synchronous assemblies (defined earlier) were modeled with the assembly jitter controlled by a parameter ε and the firing consistency controlled by the parameter κ (see Section 2.1). These simulated spike trains then formed the input to the statistical methods. The goal of this work was to compare two different statistical techniques for their abilities to detect assembly phenomena. Cross-correlation (Moore et al. 1966) and the gravity method (Gerstein et al. 1985) were chosen because synchronous assemblies are defined by coincident spiking, and these two statistics are fundamentally coincidence detectors. The two methods were evaluated for their temporal sensitivity to assembly activity and for their ability to differentiate two simultaneously active assemblies present in the same recorded region. 2 Simulation Data and Analysis Procedures 2.1 Simulated Synchronous Assemblies. Eighty spike trains were generated each time cell assembly data were needed. The simulated neurons were given properties that roughly correspond to pyramidal cells of motor cortex, though the results are not specific to any particular brain region. As will be shown, the implemented firing rates are immaterial to the analysis. The generation method is similar to that of Aertsen and Gerstein (1985). Each binary unit fired at an average background rate randomly selected between 4 and 10 Hz. The precise firing times were modeled as independent Poisson processes (equal probability of a spike at every time step) modified

54

Gary Strangman

A

B

C

D

Figure 1: Illustration of the spike train generation process for assembly units. A source spike train is generated (A), and then spikes are selected with a probability κ. Spike train (B) shows one possible result of this process for κ = 0.4. These spikes are then added to the spike train generated by the background firing rate of a unit (C), resulting in the final spike train (D). Any jitter would be incorporated after creating spike train (B) by moving each spike to the left or right by no more than ε msec.

by a 3 msec absolute refractory period following each spike. This completely describes the nature of spike trains for units uninvolved in assembly activity. For simulated units participating in assembly activity, an additional procedure was implemented (see Fig. 1). First, a source spike train was generated. This spike train also had a Poisson distribution of spike times and a 3 msec absolute refractory period but fired at an average rate, %0 , between 150 Hz and 200 Hz (randomly selected for each assembly). During assembly activity, each source spike was added to a given unit’s spike train with a probability κ. After the addition, jitter was introduced by moving each added spike slightly ahead or behind its original time of occurrence, t. This was done by adding a random value in the range [−ε, ε] to the original source spike time t, and thus ε controlled the amount of jitter in the synchrony. Finally, any spikes occurring during the 3 msec absolute refractory period of another spike were eliminated as a cleanup process. This results in a synchronous assembly unit firing at an average rate of κ%0 Hz plus the unit’s background firing rate. A high source-unit firing rate was used so that κ times this rate was still considerably larger than the 4–10 Hz base firing rates even when κ = 0.2. At values of κ near 1, the units fire at unrealistically high rates (up to 150–200 Hz). However, as will be reported in Section 3, (and derived in Appendix A), the minimum required recording time is—somewhat unintuitively—independent of the assembly firing rate, making the realism of these rates immaterial.

Detecting Synchronous Cell Assemblies

55

When two assemblies were needed for analysis, two independent source spike trains were computed. Units participating in both assemblies (50% of the simulated neurons) fired to the proportion κ of both assemblies. This was accomplished by applying the addition, jitter, and cleanup process twice— one time for each source spike train. All resulting spike trains contained additional internal structure only as occurred by chance. Again, 80 spike trains were generated per simulation, and the simulation time step was set at 0.2 msec. Data generation was performed on a VAX 6000-510 (Digital Equipment Corporation). Since the modeled spike trains depended heavily on the builtin random number generator, a battery of 10 statistical tests (Knuth 1981) was performed to ensure adequate randomness. All tests found the generator not significantly different from purely random (Students t-test: p > 0.2 for all ten tests). 2.2 Statistical Methods. Many statistics have been developed for multiple spike train analysis. Many of these analyze pairs of neurons (e.g., Moore et al. 1966; Gerstein and Perkel 1969; Gerstein et al. 1989; Gochin et al. 1989; Aertsen et al. 1989; Aertsen and Gerstein 1991; Yamada et al. 1993). Some techniques compare trios of neurons (e.g., Perkel et al. 1975; Abeles et al. 1994) or compare all recorded neurons simultaneously (e.g., Gerstein et al. 1978, 1985; Aertsen et al. 1987; Bernard et al. 1993; Gochin et al. 1994; Nicolelis and Chapin 1994; Abeles et al. 1995). In this article, two statistical methods were chosen for comparative analysis. The first method selected was cross-correlation (Moore et al. 1966), which compares pairs of spike trains. If the two spike trains are considered as functions f and g then the crosscorrelation of these two functions can be computed as follows: Z ∞ f (t + λ)g(t) dt Corλ ( f, g) = −∞

where λ is called the lag of function f relative to function g and can be positive or negative. A cross-correlation histogram (or cross-correlogram, CCH) is made by using discrete function evaluation points, selecting a histogram bin width, and summing (instead of integrating) the product of f (t + λ)g(t) into the bin covering the point λ. This process is repeated for all interesting values of λ. Cross-correlation was selected mainly because of its universal use in spike-train analysis. However, this method is also important because nearly all pairwise statistics are based on correlation and because rigorous study has provided it with adequate significance tests of the resulting cross-correlation histogram peaks (e.g., Perkel et al. 1967; Gerstein and Perkel 1972; Gochin et al. 1989). Computationally, each cross-correlogram requires O(N2 ) units of computing time (N being the number of spike trains analyzed), times the number of histograms that must be calculated, N(N − 1)/2, or O(N2 ). This makes the cross-correlation computations overall O(N4 ).

56

Gary Strangman

The second method selected was the gravity method (Gerstein et al. 1985), which analyzes all N simultaneously recorded neurons at once. This is accomplished in an N-dimensional space, where particles, one for each recorded neuron, originate at the corners of an N-dimensional hypercube (i.e., all N particles are initially equidistant). These particles then interact “gravitationally,” where the attractive force between two particles is proportional to the “charge” on the particles. A particle’s charge is instantaneously incremented whenever the neuron corresponding to that particle fires a spike, and the charge decays exponentially over time as controlled by a parameter, τ . Thus, for neurons that fire roughly synchronously, their corresponding particles will be charged simultaneously and therefore will attract one another. The result is a clustering of particles in the N-dimensional space, where clusters correspond to those groups of neurons that were simultaneously active. The method is adjusted to make the particle aggregation independent of average firing rate, to prevent net aggregation among independently firing neurons, and to eliminate attractive forces when particles reached a minimum distance. Such modifications are all described in Gerstein et al. (1985). The precise equations were as follows. For particle i at position xi (t) at time t, the position in the N-dimensional space at time t + h is calculated as xi (t + h) = xi (t) + hσg [qi (t) − τ ]fi (t), where σg is the aggregation parameter, qi (t) is the charge on particle i at time t, τ is the average charge on all particles (equal to the charge-decay constant), and fi (t) is the force on particle i at time t. This force is given by X [qj (t) − τ ] ri (j), fi (t) = j6=i

where ri (j) is the direction of the force on neuron i produced by neuron j and is given by the following ri (j) = [xj (t) − xi (t)]/sij (t), with sij (t) being the Euclidean distance between particles i and j at time t. Finally, the charge increment for neuron i, 1qi , had to be normalized for different mean firing rates among the various neurons. This is accomplished by making 1qi directly proportional to some measure of neuron i’s interspike interval (ISI). Two measures were used: (1) 1qi was set equal to the cumulative mean ISI (as in Gerstein et al. 1985), or (2) 1qi was set equal to the most recent ISI. Gravity analyses were selected as the comparison statistic because of its unique representations (spatial instead of histogram based) and because it is designed to analyze an arbitrary number of spike trains simultaneously. Computationally, the gravity method requires O(N3 ) units of computing time and, like the mass of correlograms above, requires further analysis

Detecting Synchronous Cell Assemblies

57

of the clustered particles to determine the precise nature of any recorded assembly. 2.3 Temporal Sensitivity Analyses. To evaluate the temporal sensitivity of cross-correlation and the gravity method, a synchronous assembly with zero jitter (ε = 0 msec) was generated and then analyzed. Zero jitter results in perfectly synchronous assembly-unit spikes, which are easiest for the methods to detect. Thus, the analysis should place a lower bound on the temporal capabilities of each method at each level of consistency, κ. Three seconds of data were simulated for 80 spike trains, 60 of which formed a single synchronous assembly. This was done for each of five values of κ. The resulting data were first analyzed by cross-correlations using successively longer portions of the synchronous data as input. The lengths of data analyzed were 25, 50, 100, 200, 300, . . . , 3000 msec. Twenty pairs of spike trains were selected at random for cross-correlation analysis, each pair analyzed using each length of data (up to a maximum of 32 correlograms per pair of neurons, or 6400 correlograms total). Histograms were made using a 2 msec bin width and were tested for a significant central peak using the smoothed-difference cross-correlogram (Gochin et al. 1989) at the 99% confidence level. This particular test avoids the large bin variance generated by subtracting the standard shift predictor histogram (Perkel et al. 1967). When the synchronous peak remained statistically significant for three consecutive data lengths, the time required to obtain the first significant peak was used as the “empirical” cross-correlation minimum amount of recording time to reach significance. This represented the minimum recording time required to detect significant assembly membership, Tmin . The temporal sensitivity analysis of the gravity method was performed on the same simulated data. The analyzed particle pairs were the same 20 neuron pairs used in the cross-correlograms, and the Euclidean distance between pairs of particles in the N-dimensional space was measured. In the absence of any rigorous statistical test for particle clusters, gravity Tmin was simply defined as the time it took for the particles to approach within five units of one another, given a starting position 100 units apart (with an absolute minimum particle separation of one unit). Implementing the gravity method requires setting two parameter values: the aggregation parameter, σg , and the charge decay constant, τ . The value of τ is not critical as long as it is kept small relative to the average interspike interval. However, false-positive or false-negative clustering can result from too large or too small a σg , respectively. Since there was prior knowledge available about which simulated neurons were members of the assembly, σg was optimized to afford maximal aggregation speed (i.e., to minimize recording time without giving false-positive clustering). The optimal value for these analyses was σg = 8.0 × 10−6 . In the usual experimental situation, lacking a priori knowledge of which cells are assembly cells, σg , can be a liability of the gravity method. Finally, 1qi was set to the cumulative mean ISI,

58

Gary Strangman

because this reduces the noise in the resulting interparticle distance plots. This normalization exploits the change in firing rate when assemblies turn on, but it also amplifies the clustering effects of synchrony, maximizing the speed of low-noise clustering (for further discussions about normalization, see Gerstein and Aertsen 1985). Since the method of evaluating significant assembly correlation differed dramatically between the two methods (smoothed-difference cross-correlogram significance versus a Euclidean distance of five units), there was no expectation that the absolute values of cross-correlation and gravity Tmin would be directly comparable. Planned comparisons therefore included dependence on firing consistency, κ, and the relative variability of the two methods. 2.4 Analysis of Two Assemblies. Cross-correlation and gravity clustering were also compared for their ability to detect and separate units from two active synchronous assemblies. Each statistic had to classify correctly the simulated neurons according to their assembly of origin. Two overlapping synchronous assemblies were simulated, again with 80 spike trains. Spike trains for units 1–9 and 71–80 fired at their base rates, 10–19 fired at base rates plus spikes for the first assembly, 61–70 fired at base rates plus spikes for the second assembly, and 20–60 fired at base rates plus spikes from both assemblies. Jitter for both assemblies was set at ±1 msec (i.e., ε = 1 msec). Cross-correlation and gravity methods were implemented on the entire overlapping assembly data set, and the methods were compared with respect to assembly detection and neuron classification errors. Parameter settings were as follows: unit consistency, κ = 0.4; the cross-correlogram bin width, 1 = 2 msec (i.e., 2ε); the gravity coalescence parameter (σg ) was 8.0 × 10−5 ; and the decay parameter (τ ) was set at 2 msec. The value of 1qi was set equal to the most recent ISI in neuron i to make the method most sensitive to synchrony and less so to changes in unit firing rates. 3 Results 3.1 Theoretical Temporal Sensitivity. The temporal sensitivity of cross-correlation to synchronous assemblies can be formally derived in a fashion parallel to Aertsen and Gerstein’s (1985) analysis of effective connectivity. (Refer to Appendix A for the complete formal analysis.) The result is an expression for the minimum amount of recording time required to detect significant pairwise spike-time correlation, Tmin : Tmin =

41 κ2

(or Tmin =

4σ 2 , if σ > 1), κ 21

(3.1)

where 1 is the cross-correlogram bin width, κ is the consistency with which the neurons fire in synchrony with the assembly, and σ is the width of

Detecting Synchronous Cell Assemblies

59

Table 1: Comparison of Minimum Recording Time Required to Detect a Perfectly Synchronous Assembly Using Cross-Correlation and Gravity Methods.

κ

Theoretical CCH Tmin (msec)

0.2 0.4 0.6 0.8 1.0

200 50 22 13 8

Simulated data: a Cross-Correlation Tmin (msec) Mean ± s.d. Range 95% limit 760 ± 400 200 ± 120 91 ± 100 51 ± 23 28 ± 8

100–1800 25–400 25–400 25–100 25–50

1200 400 300 100 50

Simulated data: b Gravity Tmin (msec) Mean ± s.d. Range 95% limit 760 ± 170 194 ± 52 97 ± 36 53 ± 27 25 ± 6

364–946 73–330 32–152 23–99 13–32

907 247 152 99 32

bin width, 1, was 2 msec. parameters: σg = 8.0 × 10−6 , τ = 2 msec.

a Correlogram b Gravity

the synchrony window (equal to 2ε). The derivation assumes only that a neuron’s firing rate when the assembly is active is much larger than background firing rates. (If this assumption is relaxed, the theoretical value of Tmin increases. Thus, Tmin truly represents the minimum amount of recording time.) From equation 3.1, the theoretical minimum recording time required to detect synchronous assembly correlation depends on 1 and 1/κ 2 and—perhaps unintuitively—is independent of the assembly firing rate %0 . This independence can be understood as follows: although increased firing rates provide more synchronous events per second, such rates also result in a higher background correlation level (and thus more noise). It turns out that these two effects cancel as long as background firing rates are much less than assembly firing rates. However, consistency does matter for minimum recording times; that is, the consistency with which units of an assembly fire synchronously with the assembly strongly influences Tmin . The first two columns of Table 1 show the theoretical Tmin values calculated by equation 3.1 for a range of consistency values (κ), with 1(= σ ) = 2 msec.

3.2 Empirical Temporal Sensitivity. To test these theoretical values, simulated synchronous assemblies were analyzed by both crosscorrelation and gravity methods, using five values of κ. Three typical crosscorrelograms and one gravity interparticle distance plot are shown in Figure 2 for κ = 0.4, all analyzing the same data set where ε = 0 (perfectly synchronous assembly cell spiking). The cross-correlograms show a single pair of simulated units correlated on 300, 400, and 500 msec of synchronized data, respectively. The central peak first becomes significant when the assembly has been active for 400 msec. The gravity plot shows three pairs of nonassembly units (top three traces), three pairs of assembly units (in-

60

Gary Strangman

Figure 2: Temporal sensitivity of cross-correlation and gravity methods to a single synchronous assembly, κ = 0.4. Three cross-correlograms (1 = 2 msec) are shown for a single pair of neurons, using progressively longer portions of the synchronous data (top = 300 msec, middle = 400 msec, bottom = 500 msec; 70 Hz average firing rate). The middle correlogram was the first significant central peak using the significance test in Gochin et al. (1989) at the 99% level. That is, the analyzed pair of simulated spike trains required 400 msec of data before their (known) correlation with the assembly was determined to be significant. (On average, for κ = 0.4, the neurons required approximately 200 msec; this pair had weaker average correlation, due to random variation around κ.) The gravity analysis of the same data is shown for comparison, using three assembly–assembly pairs, three assembly–nonassembly pairs, and three nonassembly–nonassembly pairs of neurons (σg = 8.0 × 10−6 , τ = 2 msec). Though no generalized significance tests have been developed for interparticle distance measurements of the gravity method, the assembly cells are distinctly separable by approximately 100 msec.

Detecting Synchronous Cell Assemblies

61

cluding the pair used for cross-correlations; bottom three traces), and three assembly–nonassembly pairs (middle three traces). The mean Tmin for 20 assembly pairs using cross-correlation was greater than the theoretical value by a constant factor of approximately 4 at all levels of κ (see Table 1, column 3). The factor of 4 is attributed to (1) the significance criterion used (99% plus three consecutive significant correlograms), which is a much more stringent signal detection condition than used in Appendix A, and (2) the fact that the background firing rates were not strictly zero. Importantly, the predicted linear trend of Tmin versus 1/κ 2 was highly significant (relative to no trend; F(1, 95) = 156.9, p ¿ 0.0001). Despite this, a broad range of experimental Tmin ’s was observed, particularly at smaller values of κ (Table 1, column 4). Since the assembly cells were perfectly synchronous (no jitter; ε = 0), only the individual unit 4–10 Hz background firing rates and random variation around κ remained as variables. To determine the minimum recording time for neurophysiological experiments, one is not concerned with the average length of recording required to detect assembly cells; rather, one is interested in the length of recording required to detect, say, 95% of the assembly cells. Such a limit for the assemblies used here is presented in Table 1, column 5. The 95% recording time boundary is typically two standard deviations above the average Tmin at all levels of κ. The same set of spike trains was then analyzed by the gravity method. The resulting gravity Tmin values were almost identical with those obtained for cross-correlation methods, despite the extremely disparate methods for determining significant correlation. No significant differences were found in a two-tailed Wilcoxon t-test, at any level of κ (p > 0.2), suggesting that gravity and cross-correlation are equally sensitive to synchronous assemblies in the temporal realm. Again, the 1/κ 2 trend was highly significant (F(1, 95) = 1099.5, p ¿ 0.0001). It is of note, however, that the variance was less pronounced in the gravity analysis than with cross-correlation. In addition, the 95% limit occurred at somewhat less than two standard deviations above the mean Tmin s. In practice, this means that gravity interparticle distance measurements required shorter recordings to detect 95% of significant assembly neuron correlations (Table 1, column 8). Optimal tuning of the gravity method (not usually possible) and the cumulative average ISI normalization procedure for 1qi (which was somewhat sensitive to the change in unit firing rates) slightly weakens the result that the gravity method is more sensitive to synchronous assemblies. A reasonable conclusion, therefore, is that the gravity method is no less sensitive to synchronous assemblies than cross-correlation. 3.3 Multiple Assemblies. Assuming that some kind of intermediatelevel structures (cell assemblies) exist in the functioning nervous system, are there any consequences of more than one assembly being active at the same time? Position emission tomography and functional magnetic re-

62

Gary Strangman

Figure 3: Ten of the 20 unique spatial arrangements of three cell assemblies. Examples 9 and 10 show two cases where the cells of one assembly (assembly A) must be spatially discontiguous to achieve the given geometric arrangements (i.e., assembly A must have at least two members).

sponse imaging scans have suggested that multiple regions of the cortex are active during overt or imagined behavior (e.g., Roland 1982; Grafton et al. 1991; Rao et al. 1993; Constable et al. 1993). If multiple assemblies are active at the same time, it is possible that multiple assemblies are active at the same time in the same area of cortex (Vaadia et al. 1989). Indeed, many theories of brain function assume precisely this type of representation (e.g., Hebb 1949; Hinton et al. 1986), and in fact any distributed code used by the brain would be extremely inefficient if this were not the case. Thus, distributed neural coding and multiple assemblies together could result in individual neurons being simultaneously activated in association with two or more assemblies. The complications introduced by this possibility are substantial. If two assemblies are simultaneously active in the same general brain area, there are three distinct spatial arrangements: (1) the two assemblies are nonoverlapping (that is, no neuron participates in the activity of both assemblies), (2) the assemblies partially overlap, or (3) the neurons of one assembly form a subset of the neurons of the second assembly. Disambiguation of the two assemblies thus involves classifying four types of neurons: those not involved in either assembly, those involved in assembly A only, those involved in assembly B only, and those associated with both assemblies. If more than two assemblies are simultaneously active, it can be shown (n−1) that the number of possible spatial arrangements increases faster than 22 , where n is the number of simultaneously active assemblies (Appendix B). In the case of three simultaneously active assemblies, there are at least 20 distinct geometrical arrangements to be differentiated (see Fig. 3); four assemblies can have a minimum of 266 distinct arrangements.

Detecting Synchronous Cell Assemblies

63

Assemblies exist in time as well as in space, and thus they can also differ in the temporal sequencing of their activity. Even if an assembly can activate at only three times (before, after, or at roughly the same time as another assembly), there are at least n! +

n−1 µ X i=1

¶ n! (n − i)! (n − i − 1)!(i + 1)!

permutations of assembly onset times. Again n is the number of simultaneously active assemblies. The same number of possibilities would exist for assembly offset times. Thus, for two assemblies (n = 2), there are three possible geometrical arrangements, and 3 × 3 possible temporal (on×off) arrangements. This results in some 27 ways in which the assemblies can interact in the spatiotemporal domain. Three assemblies have 3380 interaction possibilities, while four assemblies can interact in more than 1,266,426 spatiotemporally unique ways. A more finely divided temporal spectrum rapidly expands these possibilities. Thus, multiple active assemblies are capable of vast numbers of interactions. It remains to be determined, however, whether this can cause problems in practice. To test this, cross-correlation and gravity methods were used to separate two partially overlapping (simulated) synchronous assemblies. In one analysis, the two simulated assemblies activated 700 msec apart, while in the second case both assemblies activated simultaneously. Figure 4 shows the analysis results when assembly A began firing at 1000 msec, and assembly B activates at 1700 msec. Both assemblies continued to fire until the end of the simulated data (4000 msec). Cross-correlating a pair of simulated neurons from two different assemblies—using the first 3000 msec of the spike train—shows no significant peak (leftmost column). Crosscorrelating pairs of neurons from the same assembly shows intermediatelevel peaks (around 40 coincident spikes; middle four columns). Crosscorrelating a pair of neurons participating in both assemblies—firing to the proportion κ of the spikes from assembly A plus a similar proportion from assembly B—resulted in a substantially larger peak (approximately 70 coincident spikes; rightmost column). Hence, assuming assembly effects are more or less additive, one can classify all four types of neurons: neuron pairs with the highest peaks belong to the intersection of the two assemblies; medium-height peaks indicate neuron pairs from the same group, and flat correlograms indicate that the two neurons are from different assemblies or that at least one neuron is a member of neither assembly. Given a method for determining relative peak height significance (i.e., whether two significant peaks are different from one another) and careful bookkeeping, one can completely categorize all simulated neurons. Determining the relative temporal onsets of the two assemblies would require scrolling correlograms. The gravity interparticle (Euclidean) distance plots for the same set of simulated spike trains are shown in the third row of Figure 4. Three neu-

64

Gary Strangman

Figure 4: A comparison of cross-correlation and gravity methods when detecting two asynchronously activated cell assemblies. Circles represent the assemblies, with schematic electrodes indicating the region from which a unit is sampled. Assembly A (solid circle) begins firing 700 msec before assembly B (dashed circle). For cross-correlation (second row) 1 = 2 msec, and for the gravity analysis (third row) σg = 8.0 × 10−5 and τ = 2 msec. See text.

ron pairs are plotted in each graph to give a rough characterization of the variance. The first observation is that assembly A–assembly B neuron pairs aggregate substantially (Fig. 4, column 1), so there is no trivial separation of units from different assemblies. This occurs because units of the two assemblies are firing at approximately 70 Hz, and therefore a unit from A fires only within 7.1 msec of any unit in B, on average. Not shown is the fact that pairs of completely independent neurons (firing at 4–10 Hz) remain 100 units apart, while nonassembly–assembly pairs aggregate to approximately 75 units of separation by 4000 msec. This serves to separate pairs of units where at least one has no assembly affiliations. Pairs where at least one neuron is involved with assembly B alone (columns 3 and 5) start aggregating at 1700 msec (when assembly B activates). All other pairs begin aggregating at 1000 msec. By considering the onset times of aggregation and calculating every possible pairwise distance, it is difficult but possible to separate assembly B units as well as units involved in both assemblies (“AB units”). The remaining units are then classified as assembly A units. Finally, notice the aggregation artifact in column 3 at 840 msec. This occurred because the two units of that pair fired within 1 msec of one another and prior to that event had not fired for 630 and 372 msec, respectively. Thus both gained a large charge when they coincidentally fired in near synchrony. Such artifacts are partly a consequence of the rate normalization, whereby neurons that fire infrequently receive large charge increments. Such arti-

Detecting Synchronous Cell Assemblies

65

Figure 5: A comparison of cross-correlation and gravity methods when detecting two synchronously activated cell assemblies. Circles represent the assemblies, with schematic electrodes indicating the region from which a unit is sampled. Both assemblies begin firing at 1000 msec. Again, for cross-correlation 1 = 2 msec and for gravity σg = 8.0 × 10−5 and τ = 2 msec. See text.

facts can be reduced, but not eliminated, by a smaller value for σg (slower aggregation), or a smaller τ (a narrower synchrony window). Figure 5 shows the exact same analysis as in Figure 4 but performed on spike trains wherein assemblies A and B activate simultaneously at 1000 msec. Aside from the small (but significant) central cross-correlation peaks in columns 2 and 3, the results of this analysis are identical to those in Figure 4. This demonstrates cross-correlation’s ability to separate the two assemblies and highlights its insensitivity to assembly onset times. Again, scrolling cross-correlograms could be used, but this would be cumbersome for 80 recorded neurons (3160 cross-correlograms per time step). The gravity analysis was less successful in this situation. First, there is considerable cross-assembly aggregation again (see Fig. 5, column 1), which prevents trivial separation of the two types of assembly units. Second, the analysis was considerably more affected by aggregation artifacts. When the two assemblies first activate, particles in both assemblies are nearsynchronously charged, and these charges are large due to large prior interspike intervals. The result is sudden and dramatic particle aggregation among some assembly neurons (e.g., Fig. 5, columns 3 and 4). Again, one could decrease σg or τ , but this slows the aggregation process, making more recording necessary. At the parameter values used (σg = 8.0 × 10−5 , τ = 2 msec), the two assemblies were not quite differentiable. Unlike the situation in Figure 4, here the two assemblies activated at the same time, and thus separations

66

Gary Strangman

based on the initiation time of aggregation are impossible. A smaller value for τ reduces this A–B aggregation but makes the method less sensitive to near-synchrony. If the assembly firing rates are reduced considerably (e.g., to 10–20 Hz, or κ = 0.06) the A–B pairs aggregate much less, and thus the assembly units can be classified into their respective assemblies by computing all pairwise distances. Such a rate reduction, however, was unnecessary for cross-correlation. 4 Discussion In summary, detecting assembly activity in simultaneous neural recordings was more difficult than expected. First, cross-correlation and the gravity method may require several hundred milliseconds of assembly data in order to reveal significant correlations, even for relatively consistent assembly participation. Highly inconsistent participation may require neural recordings in excess of several seconds. Unfortunately, consistency is not a directly measurable quantity. Second, although cross-correlation and gravity were both able to differentiate units from two simultaneously active synchronous assemblies, separating three or more assemblies may well be impossible for cross-correlation. The gravity method appears somewhat more flexible in this regard. These points, as well as some of their consequences, will now be considered in more detail. 4.1 Temporal Sensitivity. The formal analysis of cross-correlated spike trains from synchronous assemblies (see Appendix A) showed that the minimum recording time to reach significance (Tmin ) is strongly dependent on κ, the consistency with which a neuron participated in the assembly. Tests of cross-correlation on simulated data supported the theoretical predictions. No formal results were obtained for the minimum recording times required for the gravity method, though the near-identical empirical Tmin ’s suggest that gravity analyses will behave similarly to cross-correlation and thus be dependent on κ, not on assembly firing rates. Using equation 3.1, and thereby assuming that assembly firing rates are significantly elevated with respect to background rates, it is theoretically possible to predict the minimum recording time required to detect a synchronous assembly in a planned experiment. Prediction relies on an estimate of κ, however, which may be unavailable. In such a case, equation 3.1 can be solved for κ and the recording time from a significant correlogram peak can be inserted for Tmin . This will give a value for κmin , which reflects the minimum consistency with which neurons must be participating in an assembly. This value, then, can be used as a reference when making further estimates of Tmin . It is noteworthy that despite perfectly synchronous assembly data and finely tuned analysis parameters to capture only those neurons participating in the assembly, the minimum recording time to detect 95% of the assembly

Detecting Synchronous Cell Assemblies

67

neuron pairs could be quite long. A minimum of 1200 msec of recording was necessary when κ = 0.2. There are two major consequences of long recording times. First, such long periods require that the assembly remains active for this entire period, or that multiple trials must be run (and thus the total assembly-active recording time must equal the derived minimum recording time). Keeping the assembly operating is usually beyond experimental control. On the other hand, summing across multiple trials has its own, familiar drawbacks. Specifically, it assumes that each assembly is activated on essentially every trial and that assembly activity is stationary across time. Neither assumption is reasonable for experimental paradigms involving learning, and the first assumption may never be reasonable. The alternative to assuming that assemblies are consistently activated on successive trials is to select data to be analyzed for assembly phenomena using some objective criteria. This approach can be successful, but it risks possible circularity by predicting effects in data that were selected for precisely those effects. The second major consequence of long minimum recording times is that they make detection of certain types of assemblies impossible using these methods. A useful illustration is a wave of activity spreading across the cortex, such as generated by a cortical seizure. Since this is a traveling wave, individual neurons participate in this event only briefly. From the point of view of a stationary microelectrode, perhaps only three or four spikes will be recorded as the wave of activity passes by. This may not be enough data to detect that a wave has passed at all. Traveling waves are usually considered pathological (Petsche and Sterc 1968), but it is certainly possible that analogous waves could form a functional part of the brain. Blum (1967), Walter (1968), Ermentrout and Cowan (1979), and others all suggested this possibility, and there is electroencephalogram evidence for traveling waves in the human cortex under normal conditions (Hughes 1995). This is not the only type of synchronous assembly that is undetectable by the correlation methods described here. For example, neurons activated in association with one (synchronous) assembly and then with another would also result in neurons only transiently correlated with any given group of cells. Both examples point to fundamental difficulties in investigating certain types of assembly behavior with these techniques. Other statistics, new or old, can avoid this limitation if they are able to exploit different parameters of assembly activity (e.g., Abeles and Gerstein 1988; Abeles et al. 1993, 1994). 4.2 Multiple Assemblies. As the number of active assemblies within a cortical area increases linearly, the number of possible spatiotemporal arrangements of those assemblies increases much faster than factorially. This huge number of possible spatiotemporal interactions is quite advantageous from the point of view of behavioral flexibility. At the same time, it is a serious problem for the neurophysiologist who is attempting to pin down the

68

Gary Strangman

spatiotemporal interaction actually in use. Cross-correlation can separate two overlapping synchronous assemblies, but this depends on a method for determining whether two independently significant correlogram peaks are significantly different from one another. More generally, when trying to separate neurons of n simultaneously active assemblies, n different correlogram peak heights will have to be differentiable. Currently no such method exists. Furthermore, in the typical experimental situation, the precise number of assemblies recorded from (i.e., the value of n) is not known in advance. In practice, these last two constraints make it unlikely that cross-correlation can differentiate three or more simultaneously active assemblies. Perhaps more problematic, cross-correlation scales very poorly. For N neurons, N(N − 1)/2 cross-correlograms must be calculated and compared. To obtain even minimal temporal resolution, cross-correlograms must be calculated using a scrolling time window. In this case, N(N − 1)/2 crosscorrelograms must be calculated for each time step. It is apparent that crosscorrelation quickly becomes untenable; for 80 neurons and 10 time steps (a resolution of 400 msec when using 4 sec of data), some 31,600 crosscorrelograms must be evaluated and compared. The time problem, but not the large number of neuron pairs, can be avoided by using the joint peri– stimulus time histogram (JPSTH; Aertsen et al. 1989), which is closely related to cross-correlation. Using the JPSTH method requires only 3160 figures, but comparison of these figures becomes considerably more complicated. The gravity method is somewhat more flexible than cross-correlation for multiple assembly analysis. The nature of the method lends itself to determining relative assembly onset times. Furthermore, it is not hampered as cross-correlation is with regard to knowing the value of n and can indeed be applied regardless of the number of active assemblies. The simultaneousonset, high-firing-rate assemblies represent a fairly extreme situation, so the difficulties of the gravity method there are of only limited consequence. In theory, the gravity method is limited by one’s ability to analyze particle clusters in an N-dimensional space. Techniques for such analyses are plentiful. In practice, limitations on the gravity method center around the aggregation parameters and the firing rate normalization procedure. As mentioned, large clustering artifacts can be minimized (but not eliminated) by shrinking σg or τ . This approach, however, makes the method less sensitive overall and thus requires more recording data. That σg is free to vary can also make it difficult to determine which clusters are significant. If rate normalization uses an average value for a neuron’s firing rate (instead of the instantaneous rate), artifacts are spread over time but not eliminated. Large increases in neuronal firing rates and sudden changes from asynchronous to synchronous firing (both of which appeared here) are most likely to cause such artifacts, though artifacts can occur spontaneously (see Fig. 4, column 3).

Detecting Synchronous Cell Assemblies

69

4.3 Other Assembly Types. The results here addressed only synchronous cell assemblies. However, there are many other plausible types of neuronal assemblies. Network-level lateral inhibition, attractor networks, and cellular automata are just a few from a wide array of alternative assembly definitions. Some assembly types may require the development of entirely new statistical tests or the application of existing techniques to a new domain. Ideally one wants to find and use the most sensitive statistic for uncovering such assemblies. Regrettably, there is no general solution to this problem. Instead, each candidate statistic must be evaluated with respect to its ability to detect each hypothesized assembly type. Techniques like principal components analysis (Nicolelis and Chapin 1994), firing space descriptions (Gerstein and Gochin 1992; Gochin et al. 1994), and JPSTH (Aertsen et al. 1989) are steps in this direction, but such techniques remain to be examined for their sensitivity to assembly phenomena. There is one final aspect of cell assembly phenomena that should be considered. It is possible that various types of functional cell assemblies are implemented simultaneously. For example, the cortex may subserve lateral inhibition networks and synchronous assemblies, with one type of assembly overlapping the other. This type of interaction is perhaps currently beyond the realm of parsimony but may need to be considered in the future. Critically, one would have to consider (1) what type(s) of assembly are present and (2) how these assembly types might interact and interfere with one another at the cellular level. Such issues are bound to become crucial as knowledge about assembly phenomena in the brain accumulates.

4.4 Conclusions. So how do we experimentally test the cell assembly hypotheses of Hebb, Abeles, and others? It would seem that four steps are necessary: (1) select an experimental paradigm to activate the hypothesized assemblies, (2) record from a sufficient number of cells in the investigated region (Strangman 1996), (3) select the statistical technique that is best able to detect the hypothesized type of assembly, and (4) record sufficient amounts of data to detect any existing assembly activity. This paper bears on the last two requirements. The results suggest that cross-correlation and the gravity method have approximately equal temporal sensitivities with regard to synchronous assemblies. The equations here can be used to help estimate the amount of data that needs to be collected. Furthermore, when dealing with multiple, overlapping assemblies, it appears that the gravity method is more flexible than cross-correlation. Ideally, one desires methods that take full advantage of the parallelism in such data. Knowing that the gravity method approaches this ideal more closely than cross-correlation may help bridge the gap between assembly-level theories of brain function and the simultaneous electrode recording data now becoming available.

70

Gary Strangman

Spikes d

e 0

eo 0 Offset

Figure 6: Theoretical cross-correlation histogram for synchronous cell assemblies (adapted from Aertsen and Gerstein 1985). Here, d is the height (in spikes) of the cross-correlogram peak, σ is the width of this peak (in msec), e0 is the average correlation level for two independent neurons, e is the average correlation level when a synchronous assembly is active, and s is the noise in correlation level e.

Appendix A The formal analysis presented here for cross-correlation closely follows the analysis of effective connectivity in cross-correlograms as given by Aertsen and Gerstein (1985). Refer to Figure 6 for notation as well as for the theoretical cross-correlation curve of two neurons active as part of the same synchronous assembly. In the absence of an active assembly, the two neurons being recorded will have an average base-level cross-correlation, e0 , such that, ignoring statistical variation, e0 = %1 %2 T1, where %1 and %2 are the neuron background firing rates, T is the length of recording, and 1 is the cross-correlogram bin width (see Table 2 for a summary of all variables). If one neuron to be correlated is part of the active synchronous assembly and the other is a neuron firing at its background rate, the average correlation level, e, is e = %10 %2 T1, where the prime indicates the (higher) firing rate of an assembly neuron when the assembly is operational. Similarly, the average correlation level for two assembly neurons, e, is e = %10 %20 T1 = (%1 + κ%0 )(%2 + κ%0 )T1,

(A.1)

where %0 is the firing rate of the assembly as a whole. Assume, then, that all spikes defined as synchronous (i.e., firing within ± ε msec of one another)

Detecting Synchronous Cell Assemblies

71

Table 2: Variables. Variable d e0

Definition Height of a cross-correlogram peak (in spikes) Average correlation level for two independent neurons (in spikes) Average correlation level during cell assembly activity (in spikes) Noise associated with the correlation level e (in spikes) Length of spike train recording (in msec) Minimum length of spike train recording (in msec) necessary to detect a synchronous assembly Consistency parameter: probability that a unit fires when the assembly source unit discharges Average firing rate of nonassembly units or neurons (in spikes/sec); Subscript 0 indicates average firing rate of the assembly source spike train Firing rate of an assembly unit or neuron = %i + κ%0 (in spikes/sec) The synchrony jitter parameter (in msec) Width of synchrony window from assembly firing (= 2ε, in msec) Cross-correlation histogram bin width (in msec) The gravity aggregation parameter (Gerstein et al. 1985 uses σ ) The gravity charge decay parameter (in msec) The gravitational charge increment at the time of a spike in neuron i

e s T Tmin κ %i %i0 ε σ 1 σg τ 1qi

are contained in a single correlogram bin (so, 2ε = σ ≤ 1). For example, if ε = 2 msec (e.g., Abeles 1982, 1991), one would set 1 ≥ 4 msec. In this case d, the height of the synchronous correlogram peak is given by µ d = κ 2 %0 T,

if σ ≤ 1.

Otherwise, d = κ 2 %0 T

¶ 1 . σ

(A.2)

Using a standard signal detection criterion, |d| ≥ 2s,

(A.3)

√ where s is the magnitude of the noise (s = e for a Poisson process). Taking equation A.3 and using e from equation A.1 and the first expression for d in equation A.2, one can solve for the recording time T; the time required to

72

Gary Strangman

detect significant correlation, T≥

41%10 %20 κ 4 %02

=

41(%1 + κ%0 )(%2 + κ%0 ) κ 4 %02

when σ ≤ 1.

(A.4)

This requires an estimate of the average firing rate of the assembly as a whole, %0 . Assuming the base firing rates, %1 and %2 , are small relative to κ%0 , they drop out and T can be expressed as Tmin =

41 κ2

(or Tmin =

4σ 2 , if σ > 1), κ 21

(A.5)

where greater-than-or-equal-to has been replaced with equals, and thus T is changed to the minimum required recording time to detect significant correlation, Tmin . So assuming that %1 , %2 ¿ κ%0 , it follows that Tmin is independent of the assembly √ firing rate. This follows from equations A.1 through A.3: the ratio of d to 2 e (the assembly detectability) is essentially unchanged by variations in %0 , and this is precisely true when %1 = %2 = 0. (See Gerstein and Aertsen 1985 for related calculations regarding the gravity method.) Appendix B Counting unique assembly Venn diagrams depends on the constraints used to define “unique.” The counting here was accomplished by a computer algorithm that evaluated vectors of binary numbers. The vectors were composed of 2n − 1 elements, where n is the number of assemblies and 2n − 1 is equal to the maximum possible number of nonempty assembly intersections. For three assemblies, the bits in the vector represented assembly intersections A, B, C, A∩B, A∩C, B∩C, and A∩B∩C, in this order (where ¯ is written as A, and so on.) Each “on” bit in a 2n − 1 element vector A∩B¯ ∩ C means the respective assembly intersection has more than zero members. Thus, 0100011 corresponds to the set {B, B∩C, A∩B∩C} and represents a nonempty assembly A being a proper subset of assembly C, which is in turn a proper subset of assembly B. All possible binary vectors of 2n −1 elements were evaluated according to the most conservative Venn diagram counting procedure. This requires that four criteria be met. First, any arrangements differing only by how many cells (> 0), or which particular cells appear in each region of the diagram, are classified as the same arrangement. This is built into the vector representation above. Second, each assembly must exist in the final intersection group. For example, for three groups A, B, and C, the set {B, B∩C, A∩B∩C} is valid, while the set {A, B, A∩B} is not a valid arrangement for three assemblies (the latter implies that assembly C is empty, meaning that there are actually only two assemblies). Third, there must be at least n distinct intersections, where n is the number of active assemblies (i.e., {A∩B∩C} is not

Detecting Synchronous Cell Assemblies

73

valid, because it represents three coincident assemblies). Thus, there must be at least n “on” bits in the vector. Fourth, one assembly must have neurons that are part of only that assembly (thus avoiding degenerate cases of two completely coincident assemblies). That is, at least one of the leftmost n bits in the vector must be on. Criteria 3 and 4 are of questionable value, and eliminating one or both slightly increases the number of unique geometric arrangements beyond the numbers reported here. Acknowledgments This article is based on work supported by a National Science Foundation Graduate Fellowship. I am indebted to J. A. Anderson for his insight into the importance of this problem. I also thank J. P. Donoghue, D. Ascher, and B. W. Connors for their helpful suggestions regarding earlier versions of this paper. References Abeles, M. 1982. Local Cortical Circuits: An Electrophysiological Study. SpringerVerlag, Berlin. Abeles, M. 1991. Corticonics. Cambridge University Press, Cambridge. Abeles, M., and Gerstein, G. L. 1988. Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. J. Neurophysiol. 60, 909–924. Abeles, M., Bergman, H., Margalit, E., and Vaadia, E. 1993. Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol. 70(4), 1629–1638. Abeles, M., Prut, Y., Bergman, H., and Vaadia, E. 1994. Synchronization in neuronal transmission and its importance for information processing. Prog. Brain Res. 102, 395–404. Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidemann, E., Tishby, N., and Vaadia, E. 1995. Cortical activity flips among quasi-stationary states. Proc. Nat. Acad. Sci. 92, 8616–8620. Aertsen, A. M. H. J., and Gerstein, G. L. 1985. Evaluation of neuronal connectivity: Sensitivity of cross-correlation. Brain Res. 340, 341–354. Aertsen, A. M. H. J., and Gerstein, G. L. 1991. Dynamic aspects of neuronal cooperativity: Fast stimulus-locked modulations of effective connectivity. In Neuronal Cooperativity, J. Kruger, ¨ ed. Springer-Verlag, Berlin. Aertsen, A. M. H. J., Bonhoeffer, T., and Kruger, ¨ J. 1987. Coherent activity in neuronal populations: Analysis and interpretation. In Physics of Cognitive Processes, E. R. Caianiello, ed. World Scientific, Singapore. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., and Palm, G. 1989. Dynamics of neuronal firing correlation: Modulation of “effective connectivity.” J. Neurophysiol. 61, 900–917. Anderson, J. A., Silverstein, J. W., Ritz, S. A., and Jones, R. S. 1977. Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psych. Rev. 84, 413–451.

74

Gary Strangman

Barlow, H. B. 1972. Single units and sensation: A neuron doctrine for perceptual psychology? Perception 1, 371–394. Barlow, H. B. 1992. The biological role of neocortex. In Information Processing in the Cortex, A. Aertsen and V. Braitenberg, eds., pp. 53–80. Springer-Verlag, Berlin. Bernard, C., Axelrad, H., and Giraud, B. G. 1993. Effects of collateral inhibition in a model of the immature rat cerebellar cortex: Multineuron correlations. Cog. Brain Res. 1, 100–122. Blum, H. 1967. A new model of global brain function. Perspect. Biol. Med. 10, 381–408. Caton, R. 1875. The electric currents of the brain. British Med. J. 2, 278. Constable, R. T., McCarthy, G., Allison, T., Anderson, A. W., and Gore, J. C. 1993. Functional brain imaging at 1.5T using conventional gradient echo MR imaging techniques. Magn. Reson. Imaging 11, 451–459. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121–130. Edelman, G. M. 1987. Neural Darwinism: The Theory of Neuronal Group Selection. Basic Books, New York. Ermentrout, G. B., and Cowan, J. D. 1979. Temporal oscillations in neuronal nets. J. Math. Biol. 7, 265–280. Falk, C. X., Wu, J. Y., Cohen, L. B., and Tang, A. C. 1993. Nonuniform expression of habituation in the activity of distinct classes of neurons in the Aplysia abdominal ganglion. J. Neurosci. 13(9), 4072–4081. Georgopoulos, A. P., Schwartz, A. B., and Kettner, R. E. 1986. Neuronal population coding of movement direction. Science 233, 1416–1419. Gerstein, G. L., and Aertsen, A. M. H. J. 1985. Representation of cooperative firing activity among simultaneously recorded neurons. J. Neurophysiol. 54, 1513–1528. Gerstein, G. L., Bedenbaugh, P., and Aertsen, A. M. H. J. 1989. Neuronal assemblies. IEEE Trans. Biomed. Eng. 36, 4–14. Gerstein, G. L., and Gochin, P. M. 1992. Neuronal population coding and the elephant. In Information Processing in the Cortex, A. Aertsen and V. Braitenberg, eds., pp. 139–160. Springer-Verlag, Berlin. Gerstein, G. L., and Perkel, D. H. 1969. Simultaneously recorded trains of action potentials: Analysis and functional interpretation. Science 164, 828–830. Gerstein, G. L., and Perkel, D. H. 1972. Mutual temporal relationships among neuronal spike trains. Biophys. J. 12, 453–473. Gerstein, G. L., Perkel, D. H., and Subramanian, K. H. 1978. Identification of functionally related neural assemblies. Brain Res. 140, 43–62. Gerstein, G. L., Perkel, D. H., and Dayhoff, J. E. 1985. Cooperative firing activity in simultaneously recorded populations of neurons: Detection and measurement. J. Neurosci. 5, 881–889. Gochin, P. M., Kaltenbach, J. A., and Gerstein, G. L. 1989. Coordinated activity of neuron pairs in anesthetized rat dorsal cochlear nucleus. Brain Res. 497, 1–11.

Detecting Synchronous Cell Assemblies

75

Gochin, P. M., Columbo, M., Dorfman, G. A., Gerstein, G. L., and Gross, C. G. 1994. Neural ensemble coding in inferior temporal cortex. J. Neurophysiol. 71, 2325–2337. Grafton, S. T., Woods, R. P., Mazziotta, J. C., and Phelps, M. E. 1991. Somatotopic mapping of the primary motor cortex in humans: Activation studies with cerebral blood flow and positron emission tomography. J. Neurophysiol. 66, 735–743. Gross, G. W., Rieske, E., Kreutzberg, G. W., and Meyer A. 1977. A new fixed-array multi-microelectrode system designed for long-term monitoring of extracellular single unit neuronal activity in vitro. Neurosci. Lett. 6, 101–105. Gray, C. M., Konig, ¨ P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature 338, 334–337. Hebb, D. O. 1949. The Organization of Behavior. John Wiley, New York. Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. 1986. Distributed representations. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, eds., vol. 1, pp. 77–109. MIT Press, Cambridge, MA. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci. 79, 2554–2558. Hughes, J. R. 1995. The phenomenon of travelling waves: A review. Clin. Electroencephalogr. 26, 1–6. James, W. 1892. Psychology (Briefer Course). Holt, New York. Knuth, D. E. 1981. Seminumerical Algorithms. Addison-Wesley, Reading, MA. Kubie, L. S. 1930. A theoretical application to some neurological problems of the properties of excitation waves which move in closed circuits. Brain 53, 166–177. McNaughton, B. L., O’Keefe, J., and Barnes, C. A. 1983. The stereotrode: A new technique for simultaneous isolation of several single units in the central nervous system from multiple unit records. J. Neurosci. Methods 8, 391–397. Moore, G. P., Perkel, D. H., and Segundo, J. P. 1966. Statistical analysis and functional interpretation of neuronal spike data. Ann. Rev. Physiol. 28, 493– 522. Nicolelis, M. A., and Chapin, J. K. 1994. Spatiotemporal structure of somatosensory responses of many-neuron ensembles in the rat ventral posterior medial nucleus of the thalamus. J. Neurosci. 14(6), 3511–3532. Nicolelis, M. A., Lin, R. C. S., Woodward, D. J., and Chapin, J. K. 1993. Dynamic distributed properties of many-neuron ensembles in the ventral posterior medial thalamus of awake rats. Proc. Natl. Acad. Sci. USA 90, 2212–2216. Nordhausen, C. T., Rousche, P. J., and Normann, R. A. 1994. Optimizing recording capabilities of the Utah Intracortical Electrode Array. Brain Res. 637, 27–36. Palm, G. 1982. Neural Assemblies: An Alternative Approach to Artificial Intelligence. Springer-Verlag, Berlin. Perkel, D. H., Gerstein, G. L., and Moore, G. P. 1967. Neuronal spike trains and stochastic point processes. II Simultaneous spike trains. Biophys. J. 7, 419–440. Perkel, D. H., Gerstein, G. L., Smith, M. S., and Tatton, W. G. 1975. Nerve impulse patterns: A quantitative display technique for three neurons. Brain Res. 100, 271–296.

76

Gary Strangman

Petsche, H., and Sterc, J. 1968. The significance of the cortex for the travelling phenomenon of brain waves. Electroenceph. Clin. Neurophysiol. 25, 11–22. Pickard, R. S., and Welberry, J. R. 1976. Printed circuit microelectrodes and their application to honeybee brain. J. Exp. Biol. 64, 39–44. Rao, S. M., Binder, J. R., Bandettini, P. A., Hammeke, T. A., Yetkin, F. Z., Jesmarowicz, A., Lisk, L. M., Morris, G. L., Mueller, W. M., Estkowski, L. D., Wong, W. C., Haughton, U. M., and Hyde, J. S. 1993. Functional magnetic resonance imaging of complex human movements. Neurology 43, 2311–2318. Roland, P. E. 1982. Cortical regulation of selective attention in man: A regional cerebral blood flow study. J. Neurophysiol. 48, 1059–1078. Strangman, G. 1996. Searching for cell assemblies: How many electrodes do I need? J. Computational Neuroscience. 3, 111–124. Vaadia, E., Bergman, H., and Abeles, M. 1989. Neuronal activities related to higher brain functions—theoretical and experimental implications. IEEE Trans. Biomed. Eng. 36, 25–35. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., and Aertsen, A. 1995. Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature 373, 515–518. Walter, W. G. 1968. Electric signs of expectancy and decision in the human brain. In Cybernetic Problems in Bionics, H. L. Oestreocher and D. R. Moore, eds., pp. 361–396. Gordon and Breach Science Publishers, New York. Wilson, M. A., and McNaughton, B. L. 1993. Dynamics of the hippocampal ensemble code for space. Science 261, 1055–1057. Yamada, S., Nakashima, M., Matsumoto, K., and Shiono, S. 1993. Information theoretic analysis of action potential trains. Biol. Cybern. 68, 215–220.

Received June 26, 1995; accepted April 11, 1996.

Communicated by Bard Ermentrout

A Simple Neural Network Exhibiting Selective Activation of Neuronal Ensembles: From Winner-Take-All to Winners-Share-All Tomoki Fukai Department of Electronics, Tokai University, Kitakaname 1117, Hiratsuka, Kanagawa, Japan Laboratory for Neural Modeling, Frontier Research Program, RIKEN (Institute of Physical and Chemical Research), 2-1 Hirosawa, Wako, Saitama 351-01, Japan

Shigeru Tanaka Laboratory for Neural Modeling, Frontier Research Program, RIKEN (Institute of Physical and Chemical Research), 2-1 Hirosawa, Wako, Saitama 351-01, Japan

A neuroecological equation of the Lotka-Volterra type for mean firing rate is derived from the conventional membrane dynamics of a neural network with lateral inhibition and self-inhibition. Neural selection mechanisms employed by the competitive neural network receiving external inputs are studied with analytic and numerical calculations. A remarkable finding is that the strength of lateral inhibition relative to that of self-inhibition is crucial for determining the steady states of the network among three qualitatively different types of behavior. Equal strength of both types of inhibitory connections leads the network to the well-known winnertake-all behavior. If, however, the lateral inhibition is weaker than the self-inhibition, a certain number of neurons are activated in the steady states or the number of winners is in general more than one (the winnersshare-all behavior). On the other hand, if the self-inhibition is weaker than the lateral one, only one neuron is activated, but the winner is not necessarily the neuron receiving the largest input. It is suggested that our simple network model provides a mathematical basis for understanding neural selection mechanisms. 1 Introduction It is believed that information processing is performed by many neurons in a highly parallel distributed manner. For example, about half of the neocortex was shown to be involved in visual information processing in the macaque monkey (Van Essen et al. 1984). This implies that coactivation of a large population of neurons is a fundamental strategy employed by the Neural Computation 9, 77–97 (1997)

c 1997 Massachusetts Institute of Technology °

78

Tomoki Fukai and Shigeru Tanaka

brain in sensory information processing and motor control. In fact, there is evidence that the neuronal population coding is adopted in various regions of the brain, including the primary visual cortex (Gilbert and Wiesel 1990), the superior colliculus for saccadic eye movements (Wurtz et al. 1980; Van Opstal and Van Gisbergen 1989), the inferotemporal cortex for human face recognition of the monkey (Young and Yamane 1992), the motor cortex decoding arm movements of the monkey (Georgopoulos et al. 1993), and the CA1 region of rat hippocampus encoding place information (Wilson and McNaughton 1993). The question is how a particular population of neurons is selected for engagement in an information processing task. Furthermore, a similar neural selection seems to provide a functional basis for active information processing of the brain such as the planning and execution of voluntary movement. It seems that voluntary movement is initiated by a selective transformation of neural activities coded in the several motor cortices into temporal sequences of motor commands through the subcortical nuclei (Hikosaka 1991; Tamori and Tanaka 1993). Actually, nerve projections from basal ganglia, which receive inputs from the motor cortices, converge to parts of the thalamus (Alexander et al. 1986). This implies that the number of active neurons involved in voluntary movement is reduced in the course of generating temporal sequences of neural activities related to motor actions from spatially distributed information in the motor cortices. In other words, it is likely that the subcortical nuclei are involved in the selection of neural activities. Because similar circuits are observed in almost the entire cerebral cortex besides the motor cortices (Alexander et al. 1986), such neural selection might be common in the processing of temporal information as in language generation and logical thinking. In this sense, it is of importance to understand mechanisms of neural selection in order to build computational models for cognitive functions as well as motor functions. A rule for selection requires some competition in general. In order to model the neural competition that might be induced in the brain, we propose a simple model describing the competition of neural activities in terms of excitatory inputs and all-to-all recurrent inhibitory connections including self-inhibition. Competitive behavior of neural systems has been repeatedly reported in the literature (Grossberg 1973; Cohen and Grossberg 1983; Yuille and Grzywacz 1989; Coultrip et al. 1992; Ermentrout 1992; Kaski and Kohonen 1994). In this paper, particular attention is paid to qualitative and quantitative changes in the selection of winners as the strength of the lateral inhibition relative to that of the self-inhibition is varied. When the ratio of strength of the lateral inhibition to that of the self-inhibition is larger than unity, only one neuron is selected to be active. On the other hand, when the ratio is smaller than unity, a certain number of neurons are selected to be active. We examine properties of such “winners-share-all” solutions to the dynamic equation of neural activities, which is derived from the conventional equations of the membrane dynamics and the sigmoidal output

A Simple Neural Network Exhibiting Selective Activation

79

function in order to make analytical studies possible. Our equation can be classified as one type of the Lotka-Volterra equations that have been used for the description of ecological systems. Thus, we may say that temporal information is generated through selection by the ecological system of neural activities. In the literature, the K-winner-take-all behavior is known as the solution that admits K (≥ 1) winners for competition in inhibitory neural networks similar to the present one (Majani et al. 1989). However, in the K-winner networks, the solutions could be systematically studied only for homogeneous external inputs (i.e., when they are neuron independent), and their dynamic behavior with nonhomogeneous inputs is in general complicated (Wolfe et al. 1991). Hence particular attention was paid to the behavior in which winners were selected according to the initial conditions. On the other hand, we show that winners are always selected according to the magnitudes of external stimuli for arbitrary initial states when each neuron receives a sufficiently strong inhibitory feedback. Such competitive behavior implies that the neural network always evolves into an internal state well representing the structure of the external world. In this paper, we analyze the stable fixed-point solutions, which can be classified into the winners-share-all or winner-take-all type, to our dynamic equation. Simulated results are also shown for both types of solutions. Then, limiting ourselves to the winner-take-all solutions, we examine analytically the relaxation of one fixed point to another when the external inputs suddenly change. 2 Derivation of the Ecological Equation for Neurodynamics For the sake of later exact analysis of competitive network behavior, we derive a Lotka-Volterra equation for firing activity from the standard equation for the membrane dynamics. Such a Lotka-Volterra dynamics was first discussed by Cowan (1968, 1970). According to the conventional model for neural dynamics, the membrane potential and firing rate of the ith neuron are expressed as follows: dui = −λui + (input current), dt

(2.1)

zi = f (ui − θ), where f (ui − θ) represents a nonlinear output function. If we assume that this function is given by the logistic function f (x) = f0 [1 + exp(−βx)]−1 ,

(2.2)

80

Tomoki Fukai and Shigeru Tanaka

the dynamics of the firing rate can be expressed by µ · ¶ ¸ βzi ( f0 − zi ) f0 − zi λ dzi = −λθ + log + (input current) . (2.3) dt f0 β zi The input current is assumed to be given by afferent inputs and lateral connections: (input current) = γ + Wi −

X

Vii0 zi0 ,

(2.4)

i0

where γ represents the input current induced by the afferent inputs nonspecific to each cell, and Wi represents the input current generated by the specific afferent inputs. Vii0 is the strength of the lateral inhibitory connections. If the firing rate is always much smaller than its maximum value f0 , this equation can be simplified as " # ¶ µ X f0 − zi λ dzi = zi 1 + Wi − zi ( f0 − zi ) log , (2.5) Vii0 zi0 + dt β f0 (γ − λθ ) zi i0 by omitting the factor ( f0 − zi )/f0 in the first term without changing a qualitative feature of the dynamics. Here γ − λθ was assumed to be positive, and the time t, specific input current Wi , and lateral inhibition Vii0 were rescaled as β(γ − λθ)t → t, Wi /(γ − λθ ) → Wi and Vii0 /(γ − λθ ) → Vii0 , respectively. The second term, indicating that the nonlinear output function has two asymptotes, takes positive values for zi ¿ f0 , while it takes negative values for zi close to f0 . Note that the boundedness of the membrane potential in the original equation (2.1) is preserved owing to the presence of this term. Since our system has global inhibitory connections for which the divergence in activity never occurs, we do not need to retain the negativity of the term for zi close to f0 . Thus, for simplicity, the term can be replaced by a positive constant ε. As we will see later, this replacement does not change the qualitative behavior of the dynamics. Finally we obtain the following equation for neurodynamics: " # X dzi = zi 1 + Wi − Vii0 zi0 + ε, dt i0

(2.6)

which is a type of the Lotka-Volterra equation. In the following, we consider an N-neuron system that has self-inhibition as well as uniform lateral inhibition: ½ 1 i = i0 (2.7) Vii0 = k i 6= i0 ,

A Simple Neural Network Exhibiting Selective Activation

81

where k is a positive constant representing the relative strength of the lateral inhibitory connections to the self-inhibitory ones. The self-inhibition is introduced since biological neural networks often have local inhibitory interneurons that deliver feedback inhibition to the cells activating those interneurons (Shepherd 1990). The lateral inhibition is assumed to be uniform for the sake of mathematical simplicity. Thus equation 2.6 is written as X dzi = zi (1 − zi − k zj + Wi ) + ε, dt j6=i

i = 1, . . . , N

(2.8)

which yields the basic equation for the present analytic study of neurodynamics. 3 The Dynamic Properties and the Steady States of the Model The network model (see equation 2.8) with k = 1 was first derived for a selforganization model of primary visual cortex by one of the present authors (Tanaka 1990) to describe the competitive evolution of densities of synapses in local cortical regions. It is also possible to interpret the network model formally as a type of shunting inhibition model (Grossberg 1973; Cohen and Grossberg 1983). By using new variables xi = zi − µ with ε = (1 + Nk − k)µ2 , we can rewrite equation 2.8 as dxi = −(1 + Nk − k)µ(ε)xi + (1 + Wi − xi )(µ(ε) + xi ) dt X xj , − (xi + µ(ε))k

(3.1)

j6=i

which represents a shunting model with linear-response cells receiving cellnonspecific excitatory input µ(ε), positive feedback xi , and lateral inhibition P k j6=i xj . From equation 3.1, we can immediately see that each xi can fluctuate within a finite interval [−µ(ε), 1 + Wi ] if its initial states lie in the same interval. This implies that zi also fluctuates in a finite interval of positive values. An unusual feature of this shunting inhibition model is that the afferent input Wi appears in the cutoff factor for excitatory input, and accordingly the cutoff activity level is stimulus dependent. Also xi is usually interpreted as the membrane potential in the shunting models, whereas it may be interpreted as the firing rate in the present model. Cohen and Grossberg (1983) showed that nonlinear systems like equation 3.1 have global Lyapunov functions, which ensure absolute stability of dynamic evolution of the systems. In fact, by introducing zi = y2i , we can transform equation 2.8 into the following gradient dynamics in which energy function E is minimized with time: ∂E dyi =− , dt ∂yi

(3.2)

82

Tomoki Fukai and Shigeru Tanaka

with the energy function given by

E({yi }) = −

N X 1 + Wi i=1

4

y2i

Ã !2 N N N 1−k X k X εX 4 2 + yi + yi − log yi . (3.3) 8 i=1 8 i=1 2 i=1

This ensures the global stability of steady-state solutions to equation 2.8. The time evolution of the network state with k = 1 was reported to exhibit the winner-take-all (WTA) behavior in which activities of all cells except one receiving a maximum input are eliminated (Tanaka 1990). The other cases of 0 < k < 1 and k > 1 also show intriguing properties, which we call winners-share-all (WSA) and variant winner-take-all (VWTA), respectively. The WSA behavior admits more than one winner receiving inputs larger than some critical value determined by the distribution of strength of the inputs. The VWTA behavior admits only one winner, but it can be any cell receiving an input larger than some critical value. We first discuss the two cases of 0 < k < 1 and k > 1 since these cases are of primary concern in the present paper and can be dealt with in a similar manner. The case of k = 1 is also reviewed to complete the survey of the network dynamics. 3.1 WSA Case (0 < k < 1). We will derive an equilibrium solution, which represents the WSA behavior, to equation 2.8 when the strength of lateral inhibitory connections satisfies 0 < k < 1. We assume that ε and all Wi ’s are positive. It is also assumed that ε satisfies dε ¿ Wi for all i, with d being the number of winners in the steady state. We first examine solutions for ε = 0 and then analyze those for ε 6= 0 in a perturbative manner. We can explicitly study the stability of the solutions only for ε = 0. Therefore the results of the stability analysis hold only when ε is not too large. Studying this restricted case, however, should be sufficient for practical purposes since ε can be infinitesimally small in many applications of the present model. We can assume, without loss of generality, that external input Wi ’s obey the following inequality regarding their magnitudes: W1 ≥ W2 ≥ W3 ≥ · · · ≥ WN−1 ≥ WN ≥ 0.

(3.4)

A stable fixed-point solution {z(D) i } given by dzi /dt = 0 (i = 1 · · · N) for k 6= 1 and ε = 0 satisfies X j∈D

Kij zj(D) = 1 + Wi , z(D) = 0, i

i∈D (3.5) i∈ /D

A Simple Neural Network Exhibiting Selective Activation

83

where D = {1, 2, · · · , d} indicates a set of winners, and d × d matrix Kˆ = (Kij ) is given by  1 k  Kˆ = . .. k

k 1 .. . ···

··· k k

k ··· .. . k

 k k  ..  . . 1

(3.6)

When k 6= 1, Kˆ is not singular and has an inverse matrix,  Kˆ −1 =

1 (k − 1)α

   

k−α k .. . k

k k−α .. . ···

··· k k

k ··· .. . k

k k .. . k−α

    , 

(3.7)

where α = dk + 1 − k > 0. Now defining ζi for i = 1, 2, . . . , N by Wi kd 1 + + hWiD , α 1 − k (k − 1)α 1X Wj , hWiD = d j∈D ζi =

(3.8) (3.9)

we obtain the stable fixed-point solution for k 6= 1 and ε = 0 as follows: = ζi δiD , z(D) i ½ 1 i∈D . δiD = 0 i∈ /D

(3.10)

The winners receiving larger inputs acquire larger values for nonvanishing z(D) i ’s as long as 0 < k < 1 is satisfied. Thus for this range of the magnitude of lateral inhibition, the network system exhibits the WSA behavior. Now we examine the stability conditions of solution 3.10. Substituting + δz(D) zi (t) = z(D) i i (t) into equation 3.8 and retaining the terms linear in the fluctuation δz(D) i (t), we obtain dδz(D) i dt

=

 P  −ζi (δz(D) +k δzj(D) ) + o(δz(D)2 ), i i   j=1...N,j6=i   

(1 −

k)ζi δz(D) i

+

o(δz(D)2 ). i

i∈D (3.11) i∈ /D

84

Tomoki Fukai and Shigeru Tanaka

Then the stability of the solution for d > 1 is determined by the eigenvalues of the following matrix: 

−ζ1 −kζ2  .. .  ˆ = −kζd M  0  . ..

−kζ1 −ζ2 ···

0

−kζ1 −kζ2 .. . −kζd ··· .. . ···

··· −kζ2 −ζd 0 .. . 0

··· ··· .. . −kζd (1 − k)ζd+1 .. . 0

··· ···

··· ···

−kζd 0 .. . ···

··· ··· ··· 0

 −kζ1  −kζ2  ..   .  .(3.12) −kζd   0   ..  . (1 − k)ζN

ˆ should be negative for the stability of the lowest-order All eigenvalues of M solution. This requires that ζi be positive for d winners and negative for (N − d) losers. Therefore the stability condition is consistent with the condition zD i > 0 (i ∈ D ) for the existence of the WSA solution. The condition can be expressed as 1 + Wmin >

kd (hWiD − Wmin ), 1−k

(3.13)

where Wmin is the smallest external input among Wi ’s for i ∈ D: Wmin = Wd by assumption 3.4. Since the right-hand side of equation 3.13 is an increasing function of d while the left-hand side is a decreasing function, there exists a critical upper bound dc above which condition 3.13 is not satisfied, or equivalently ζdc +1 , ζdc +2 , . . . , ζN < 0 (see Appendix A). This critical value gives the actual number of winners. Condition 3.13 indicates that the number of winners decreases as the relative strength k of the lateral inhibition approaches unity. The solution with no winner, d = 0, is forbidden even for k → 1− . From equations 3.8 and 3.12 with d = 0, it is found that all ζi = (1 − k)−1 (1 + Wi )’s must be negative in that case to ensure the stability of such a solution. This, however, never occurs for 0 < k < 1. Indeed, the network model exhibits the WTA behavior for k = 1, as shown later, and hence the dynamic behavior of the network continuously changes from WSA to WTA as k → 1− . From evolution equation 2.8 with ε = 0, we see that once some cells = 0, they can never become winners even when become losers with z(D) i the magnitudes of Wi ’s are changed to take a different order. Thus positive ε should be retained to ensure that the network undergoes transition in its activity pattern to adapt to changes in external inputs. The perturbative steady-state solution in the leading order of ε is easily obtained as = (ζi + εσi )δiD − ε z(D) i

1 (1 − δiD ) + O(ε 2 ), (1 − k)ζi

(3.14)

A Simple Neural Network Exhibiting Selective Activation

85

where σi =

X 1 (αζi−1 − k ζj−1 ). (1 − k)α j∈D

(3.15)

The stability condition at the zeroth order of ε (ζi > 0 for i ∈ D and ζi < 0 for ∈ / D) ensures that activities of losers are slightly raised from the zero level by positive ε. The averaged activity of the winners is also raised by P P (ε/d) j∈D σj = (ε/dα) j∈D ζj−1 . In Figure 1a, we show an example of the time course of the network state obtained for 0 < k < 1 by numerical simulations of the model network with 30 cells. Initial states at t = 0 were randomly selected in the interval [0, 1]. The inhibitory interactions reduce the net activity of the whole network system rapidly in the initial transient of the time evolution and then the surviving net activity is gradually distributed to winners according to the mechanism of the WSA. 3.2 VWTA Case (k > 1). For a stronger lateral inhibition of k > 1, the solution given by equation 3.14 is stable at the zeroth order of ε if the con/ D are obeyed. It is easy to see that ditions ζi < 0 for i ∈ D and ζi > 0 for i ∈ the number of winners cannot be more than one: It is not possible to satisfy the former condition for the winner zd receiving the smallest external input k P among the winners because αζd = 1 + Wd + k−1 ( j∈D Wj − dWd ) cannot be negative. Thus all the solutions with d > 1 are unstable. For d = 1, the stability analysis shows that all ζi ’s must be positive. Let cell a be the winner. Then ζa = 1 + Wa > 0 by the assumptions made previously, and furthermore, the condition k > 1 can make ζj = 1 + (kWa − Wj )/(k − 1) positive for ∀j 6= a even when Wa is not the largest among all inputs. Thus the network exhibits the WTA behavior in the sense that only a single cell can survive with a nonvanishing activity at equilibrium. This WTA behavior, however, is rather different from the one usually assumed: Any cell a other than the one receiving the largest input can be the winner, as long as it satisfies k(1 + Wa ) > (1 + Wj ),

(3.16)

for ∀j 6= a. In the time course of the network dynamics, a winner is selected from among those satisfying equation 3.16 according to an initial state given to the network at t = 0. Thus the results of selection in general depend on both external inputs and initial conditions. We call this type of WTA behavior the variant winner-take-all behavior to distinguish it from the conventional one in which the winner is always the cell receiving the largest input. In fact, behavior similar to VWTA has already been extensively discussed for a competitive shunting network (Grossberg and Levine 1975), and the socalled K-winner-take-all network (Majani et al. 1989; Wolfe et al. 1991), that

86

Tomoki Fukai and Shigeru Tanaka

Figure 1: The time courses of cell activities obtained by numerical simulations of the model network with N = 30 for (a) k = 0.85, (b) k = 1.06, and (c) k = 1. The distribution of {Wi } used in the simulations is shown in the form of a bar graph in (b). Several cells receiving the largest external inputs are numbered according to the magnitudes of the inputs.

A Simple Neural Network Exhibiting Selective Activation

87

Figure 2: The dynamical flow of the network state is shown for the network of two cells. W1 > W2 and k(1+W2 ) > 1+W1 . The filled circles represent attractors, while the empty one represents an unstable fixed point. A winner is selected according to initial states of time evolution under the operation of the VWTA mechanism.

is similar to the present one except that external inputs are often assumed to be neuron independent. Since our primary aim was to show that the special range (k < 1) of the ratio between the self-inhibition and uniform lateral inhibition achieved the neural activity selection in terms of the magnitudes of external inputs independent of initial conditions, we show only a few examples for the VWTA behavior without an extensive analysis of it. Figure 1b shows an example of the time evolution of zi (t)’s for the network exhibiting the VWTA behavior. The cell receiving the third-largest input defeated others and became a winner in this case. To obtain a better insight into the VWTA behavior, we show in Figure 2 the dynamical flow of the network state for N = 2. Both (1 + W1 , 0) and (0, 1 + W2 ) are attractors if k(1 + W2 ) > 1 + W1 is satisfied (i.e., the network is subjected to the VWTA behavior); otherwise, the former is the only attractor (i.e., it is subjected to the WTA behavior).

88

Tomoki Fukai and Shigeru Tanaka

3.3 WTA Case (k = 1). Now we discuss the behavior of the equation for k = 1, which is described by Ã ! N X dzi = zi 1 − zi0 + Wi + ε . (1 ≤ i ≤ N) . (3.17) dt i0 We can obtain fixed points by setting the time derivative dzi /dt equal to 0 in equation 3.17. Let us assume that Wi 6= Wi0 for any i and i0 such that i 6= i0 . By considering the first order in the perturbation expansion of ε, we obtain N fixed points, zE(n) (n = 1, . . . , N):   X 1 1  δi,n z(n) = (1 + Wn )δi,n + ε  − i 1 + Wn Wn − Wj j6=n +

ε(1 − δi,n ) + o(ε 2 ). Wn − Wi

(3.18)

Next, let us examine linear stability around a particular fixed point, zE(n) . (n) (n) When we let zi = z(n) i + δzi , we can derive a linearized equation for δzi from equation 2.1:  P ε 2  δz(n) if i = n, −(1 + Wn ) δzj(n) −  n + o(ε ),   1 + W n (n) j  dδzi = (3.19)  dt P (n) ε  (n) 2  δzj + o(ε ), otherwise.  −(Wn − Wi )δzi − W − W n i j Since the coefficient 1 + Wn is always positive, the fixed point under consideration is stable if Wn is the maximum of all Wi ’s. Thus we can say that neural activities play a survival game based on competitive exclusion, which finally leads to a winner-take-all state. Figure 1c shows an example of the time evolution of zi (t)’s for the network showing the WTA behavior. It is seen that the cell receiving the largest input becomes the only winner. When the magnitudes of the external stimuli are suddenly changed, the network state relaxes to a new equilibrium state in which some of the old winners are superseded by losers. The relaxation time for this process is an important quantity that determines the response time of the neural system. Due to the particularly simple structure of the equilibrium states in the WTA case, the relaxation time can be analytically estimated as ¶ µ 1W 0 1 (3.20) log τ∼ = 1W 0 2ε in a certain limited case for infinitesimal values of ε (see Appendix B). Here 1W 0 represents the difference between the largest and second-largest inputs in the new external stimulus environment.

A Simple Neural Network Exhibiting Selective Activation

89

3.4 Simulations of the Original Competitive Network. Transforming the network model (equation 2.1) into the Lotka-Volterra equation allows us to study the competitive behavior analytically. This transformation is justified when the ratio λ/β is negligibly small. To examine to what extent the present Lotka-Volterra system can approximate the original network model, simulations were conducted for the network model with the connections given by equation 2.7. Some facts are to be noted before the simulation results are presented. First, the distinction between winners and losers is not so manifest in the original model as in the Lotka-Volterra system (with ² = 0), since the values of output f (ui ) never reach zero for finite values of the membrane potential. In the present simulations, a neuron unit is regarded as a loser if its output in equilibrium is less than 10−3 . Second, in some cases, the winners for equation 2.8 exhibit output values beyond the range zi < f0 to which the behavior of the original model is limited. This is because we have omitted the factor ( f0 − zi )/f0 in equation 2.3 when deriving the Lotka-Volterra system. Therefore the qualitative behavior of equation 2.1 should be examined also in such a case. Figures 3a and 3b shows the numerically obtained boundaries on which different types of solutions switch to one another for the original model. The solid curves represent the boundaries between the WTA and VWTA solutions, while the dashed curves indicate those between the WSA and WTA solutions. The former curves were determined as the boundaries below which an initial configuration such as u2 (0) = W2 , u1 (0) = u3 (0) = · · · = 0 evolves into a final configuration with u1 (t) being the largest among all. The latter curves were obtained by simulations of equation 2.1 with initial configurations such as u1 (0) = u2 (0) = · · · = 0. Locations of the boundary curves depend on the criterion to distinguish winners from losers. In the corresponding Lotka-Volterra system, the critical value of k separating the WTA and WSA solutions is k ≈ 0.91, while that separating the WTA and VWTA solutions is k+ ≈ 1.1, for external inputs assumed in the simulations (see Section 4 for details of k± ). In Figure 3a f0 = 2, and all outputs of the Lotka-Volterra system are within the range 0 < z < f0 , while in Figure 3b f0 = 1 and some outputs are not within the range. Figure 3a shows that the behavior of the original network for different values of k can be appropriately described by the Lotka-Volterra system as long as λ is sufficiently small. Furthermore, the two neural networks exhibit qualitatively the same dynamic behavior in Figure 3b. It is noted that the solid and dashed curves intersect each other at a certain value of λ. Beyond this value, the WTA behavior occurs in an initial-value dependent manner in the region between the solid and dashed curves. In other words, an only winner in this region can be either a neuron with the largest input or one with the second-largest input depending on initial values. Such network behavior is classified as VWTA behavior by definition, and the VWTA and WSA behaviors switch at the boundaries given by the solid curves.

90

Tomoki Fukai and Shigeru Tanaka

Figure 3: The behavior obtained at equilibrium from simulation of the original model given in equation 2.1 Five neurons were used, and their external inputs were 0.9, 0.7, 0.5, 0.3, and 0.1. Two parameters β and γ were fixed at 5 and 0.4, respectively, while λ was varied. The asymptote of the response function was (a) f0 = 2 and (b) f0 = 1, respectively. In the latter case, 1 + Wi > f0 for any i.

A Simple Neural Network Exhibiting Selective Activation

91

Figure 4: A schematic diagram showing the switchings between different types of solutions to our Lotka-Volterra equation. The critical values of the ratio k of strength of lateral inhibition to that of self-inhibition are given in the text.

4 Discussion For convenience in mathematical analysis, we studied the solutions to our system separately for the three cases of 0 < k < 1, k = 1, and k > 1. For a given distribution of afferent inputs satisfying equation 3.4, the critical values of k at which switchings from one class of solutions to another occur can be obtained as follows: Setting d = 2 in equation 3.13 and taking only W1 and W2 into account, we can easily see that the switching between the WTA behavior and the WSA behavior with two winners takes place at k = k− ≡ (1 + W1 )/(1 + 2W1 − W2 ). Furthermore, setting d = N in equation 3.13, we find that PNall the neurons remain activated for k less than kL ≡ (1 + WN )/(1 + Wj − NWN ). Thus no neural selection occurs for 0 < k < kL . Note WN + j=1 that 0 < kL < k− < 1. Similarly, by setting a = 2 and j = 1 in equation 3.16, we can immediately see that the neuron receiving the second-largest input can be a winner when k is larger than k+ ≡ (1 + W1 )/(1 + W2 ), where k+ > 1. It is also noted that any neuron can be a winner if k > (1 + W1 )/(1 + WN ). Thus our competitive network exhibits the WSA behavior for kL < k < k− , the WTA behavior for k− < k < k+ , the VWTA behavior for k > k+ , and no selection for 0 < k < kL (see Fig. 4). Assuming different magnitudes for the self- and lateral inhibitions seems to be biologically reasonable. In the local feedback loop in biological neural networks, an interneuron may inhibit adjacent principal cells that activate the interneuron. The strength of this self-inhibition of the principal cells is likely to be different from that of lateral inhibition. Uniform strength was assumed for the lateral inhibition for the sake of rigorous analysis of the population dynamics of neurons. Under these assumptions, the model predicts that competitive neural networks tend to exhibit the WSA behavior when the self-inhibition is stronger than the lateral one. In particular, the model suggests a possible functional role of the self-inhibition: With it, the activity selection occurs in a stimulus-dependent manner, independent of initial conditions. Let us make comparisons of performances between the present network and networks with a conventional on-center off-surround type interactions,

92

Tomoki Fukai and Shigeru Tanaka

which has been fully discussed by Grossberg (1973) using shunting inhibition models. Regarding neural connectivity, our network differs from the conventional one only in the extent of the off-surround inhibition: all-to-all inhibitory connections are assumed in our model. Thus competition occurs not in the vicinity of edges where intensities of input signals change dramatically but among all input signals involved in a stimulus pattern. Based on our analysis presented here, for large values of k (k > k− ), only one neuron remains active, and the others become silent. This means that the network behaves to enhance the contrast between input signals in a WTA manner when the off-surround inhibition is strong. On the other hand, when the value of k is in the interval kL < k < k− , which represents not-so-strong offsurround inhibition, the network enhances the contrast in a WSA manner. For extremely small values of k (k < kL ), no tuning occurs, and an output activity pattern of the network is not very different from an input pattern. Biological significance of our network model may be seen in the neural selection performed by the superior colliculus (SC). In the SC, there is a map that represents the displacement vector of saccadic eye movement (Robinson 1972). A class of neurons in the intermediate layer of the SC exhibit burst firing just before the onset of saccadic eye movement (Wurtz and Munoz 1994). To generate a specific eye movement, the selection of neural activity that encodes a specific displacement vector needs to occur. It is expected that lateral inhibition within the SC may function for this neural selection. According to Wurtz et al. (1980), the lateral inhibition is so widely distributed in the SC that the inhibitory effect influences beyond 50 degrees in the visual space. This implies that inhibitory connections among neurons in the intermediate layer may be regarded as all-to-all connections, as we assumed in our model. Based on this idea of using long-range lateral inhibition, a model proposed by Van Opstal and Van Gisbergen (1989) successfully demonstrated the generation of saccadic eye movement. Recently, Optican (1994) proposed a network model of saccade generation in which burst cells in the intermediate layer of SC interact with one another through mutual inhibition. Although our mathematical model is much simpler than biological networks in the SC, the model is expected to provide an analytical understanding of the underlying mechanism for the activity selection that may actually occur in the SC. The extension of the competitive neural network from the WTA type to the WSA type also has the following theoretical and biological significance: Sensory information is likely to be encoded in the cortex by a large population of neurons. In the neuronal population coding, a single neuron may weakly respond to more than one specific sensory stimulus; that is, a stimulus can coactivate a large number of neurons. As a result, the populations of neurons responding to different stimuli will largely overlap with one another and the cortical representation of the stimulus space inevitably becomes rather complicated. The WSA mechanism in this study seems to provide a simple and appropriate theoretical basis for exploring the

A Simple Neural Network Exhibiting Selective Activation

93

information-theoretical implications of such a complicated neuronal population coding. It is of particular interest to investigate how and to what extent a self-organized cortical map of sensory information is changed if the WSA instead of the WTA behavior is employed in a layer of featuredetecting cells. The present model network will be useful for clarifying these questions because it yields a mathematically simple and tractable description of the feature-detecting layer. Furthermore, the neural selection mechanism based on the WTA for k = 1 has recently been employed in modeling the basal ganglia and motor cortices, which are involved in the generation of voluntary movement (Tamori and Tanaka 1993). It is worthwhile to examine how a new selection rule based on the WSA behavior enhances the model performance in encoding and decoding temporal information on voluntary movement. 5 Conclusion We derived a Lotka-Volterra equation of the mean firing rate from a standard membrane equation for a network composed of mutually inhibiting neurons. By employing the equation for neural selection, we proved that the ratio k of strength of lateral inhibition to that of self-inhibition is of paramount importance in categorizing the network behavior into three different types: WSA, WTA, and VWTA. The WSA behavior implies that the number of winners is in general larger than one and increases as k decreases. The VWTA behavior gives one winner, but it is not necessarily the cell that receives the maximum input. Appendix A: The Number of Winners for Equidistantly Distributed {Wi } We show a simple case where the number of winners (dc ) can be explicitly determined from condition 3.13. Assume that the external inputs are distributed equidistantly according to Wi = Wi−1 + 1 with a constant 1 for any i. Then a short manipulation gives the following expression of dc :  3 1 dc =  − + 2 k

s

µ

3 1 − 2 k

¶2

µ

¶µ

1 −1 +2 k

1 + W1 1+ 1

¶

 ,

(A.1)

where [· · ·] stands for the gaussian notation. We can see that dc approaches unity from above as k → 1. Appendix B: Nonlinear Relaxation Time for Synaptic Modification We are interested in how the neural activities relax from an old stable state to a new stable state when the configuration of the afferent inputs {Wi } slightly

94

Tomoki Fukai and Shigeru Tanaka

changes to {Wi0 } at t = 0. Without loss of generality, we can assume that W1 is the maximum and W2 is the second maximum of all Wi ’s for t < 0, and that W20 is the maximum and W10 is the second maximum of all Wi0 ’s for t > 0. First, we discuss the linear stability around the old stable fixed point zE(1) for t > 0. The corresponding linearized equation is  P  (W10 − W1 )(1 + W1 ) − (1 + W1 ) j δzj(1)  (1)  dzi (1) 0 for i = 1 (B.1) = + (W1 − W1 )δz1 + o(ε),  dt   0 (1) otherwise. (Wi − W1 )δzi + o(ε), Comparing the order of magnitude of the first term for i = 1 in equation B.1 with that of other terms, we find that this term determines the dynamical behavior at this stage. Thus, we find that the system moves linearly with 0 respect to time from zE(1) toward the following new fixed point zE(1) which is not stable:   X 1 1 0  δi,1 = (1 + W10 )δi,1 + ε  − z(1) i 1 + W10 W10 − Wj0 j6=1 +

W10

ε (1 − δi,1 ) + o(ε 2 ) . − Wi0

(B.2)

This relaxation requires time τ1 , which can be roughly estimated by , τ1 =

0 (z(1) 1

−

z(1) 1 )

1 dδz(1) i = + o(ε). dt 1 + W1

(B.3)

This time scale implies that the system relaxes very rapidly to the nearest new fixed point, which is not stable. 0 0 Next the system moves from zE(1) to a new stable fixed point zE(2) . In this process, we may expect that the only relevant variables are z1 (t) and z2 (t). Hence, we can neglect time evolution for other zi (t)’s. Consequently, we can reduce equation B.1 to the following equations: dz1 = z1 (1 − z1 − z2 − r + W10 ) + ε, dt

(B.4a)

dz2 = z2 (1 − z1 − z2 − r + W20 ) + ε, (B.4b) dt P zi = o(Nε). Using new variables ξ = z1 + z2 and η = z2 , we where r = i6=1,2

obtain dξ = ξ(1 − r + W10 − ξ ) + 2ε + 1W 0 η, dt

(B.5a)

A Simple Neural Network Exhibiting Selective Activation

dη = η(1 − r + W20 − ξ ) + ε, dt

95

(B.5b)

where 1W 0 = W20 − W10 . Here, the problem is how this system changes from 0 0 one state: ξ ≈ o(1) and η = z(1) 2 ≈ ε/1W for t = 0 to the other state: ξ ≈ o(1) 0 ≈ o(1) for t > 0. We regard the last two terms 2ε + 1W 0 η in and η = z(2) 2 equation B.5a as a small perturbation. Since ξ rapidly follows the change in η, ξ can be approximated by 1W 0 η + 2ε . ξ∼ = 1 − r + W10 + 1 − r + W10

(B.6)

If we substitute ξ of equation B.5b for the right-hand side of B.6, we obtain dη = aη(b − η) + o(ε), dt

(B.7)

where a = 1W 0 /(1 − r + W10 ) and b = 1 − r + W10 . This differential equation has the asymptotic solution η(t) = b

c(t) . 1 + c(t)

(B.8)

c(t) satisfies the equation dc(t)/dt = ac(t)+o(ε). The solution to this equation is c(t) = (c0 + ε/a)eat − o(ε), where c0 is the initial value of c(t) and is given by c0 ∼ = ε/1W 0 . Let us define the relaxation time τ2 at this stage (Suzuki 1978) by c(τ2 ) = 1. η becomes half of the maximum b/2 from the initial infinitesimal value after time τ2 . Therefore, the relaxation time is given by τ2 ∼ =

¶ µ 1 1W 0 . log 1W 0 2ε

(B.9)

From the relaxation times τ1 and τ2 obtained above, it is evident that τ2 is much larger than τ1 for infinitesimal values of ε. Consequently, the total relaxation time τ can be given by τ2 . References Alexander, G. E., DeLong, M. R., and Strick, P. L. 1986. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Ann. Rev. Neurosci. 9, 357–381. Cohen, M., and Grossberg, S. 1983. Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. Systems, Man and Cybernetics 13, 815–826. Coultrip, R., Granger, R., and Lynch, G. 1992. A cortical model of winner-take-all competition via lateral inhibition. Neural Networks 5, 47–54.

96

Tomoki Fukai and Shigeru Tanaka

Cowan, J. D. 1968. Statistical mechanics of nervous nets. In Neural Networks, E. R. Caianiello, ed., pp. 181–188. Springer-Verlag, Berlin. Cowan J. D. 1970. A statistical mechanics of nervous activity. In Some Mathematical Questions in Biology, AMS Lectures on Mathematics in the Life Sciences 2, pp. 1–57. American Mathematical Society, Providence, RI. Ermentrout, B. 1992. Complex dynamics in winner-take-all neural nets with slow inhibition. Neural Networks 5, 415–431. Georgopoulos, A. P., Taira, M., and Lukashin, A. 1993. Cognitive neurophysiology of the motor cortex. Science 260, 47–52. Gilbert C. D., and Wiesel, T. N. 1990. The influence of contextual stimuli on the orientation selectivity of cells in primary visual cortex of the cat. Vision Res. 30, 1689–1701. Grossberg, S. 1973. Contour enhancement, short term memory, and constancies in reverberating neural networks. Studies in Appl. Math. 52, 213–257. Grossberg, S., and Levine, D. 1975. Some developmental and attentional biases in the contrast enhancement and short term memory in recurrent neural networks. Studies in Appl. Math. B53, 341–380. Hikosaka, O. 1991. Basal ganglia—possible role in motor coordination and learning. Current Opinion in Neurobiology 1, 638–643. Kaski, S., and Kohonen, T. 1994. Winner-take-all networks for physiological models of competitive learning. Neural Networks 7, 973–984. Majani, E., Erlarson, R., and Abu-Mostafa, Y. 1989. On the K-winners-take-all network. In Advances in Neural Information Processing Systems I, D. S. Touretzky, ed., pp. 634–642. Morgan Kaufmann, Los Altos, CA. Optican, L. M. 1994. Control of saccade trajectory by the superior colliculus. In Contemporary ocular motor and vestibular research: A tribute to David A. Robinson, A. F. Fuch, T. Brandt, U. Buttner, and D. S. Zee, eds., pp. 98–105. SpringerVerlag, Berlin. Robinson, D. A. 1972. Eye movements evoked by collicular electrical stimulation in the alert monkey. Vision Research 12, 1285–1302. Shepherd, G. M. 1990. The Synaptic Organization of the Brain. 3d ed. Oxford University Press, New York. Suzuki, M. 1978. Theory of instability, nonlinear Brownian motion and formation of macroscopic order. Phys. Lett. 67A, 339–341. Tamori, Y., and Tanaka, S. 1993. A model for functional relationships between cerebral cortex and basal ganglia in voluntary movement. Society for Neuroscience Abstracts 19, 547. Tanaka, S. 1990. Theory of self-organization of cortical maps: Mathematical framework. Neural Networks 3, 625–640. Van Essen, D. C., Newsome, W. T., and Maunsel, J. H. R. 1984. The visual field representation in striate cortex of the macaque monkey: Asymmetries, anisotropies and individual variability. Vision Res. 24, 429–448. Van Opstal, A. J., and Van Gisbergen J. A. M. 1989. A nonlinear model for collicular spatial interactions underlying the metrical properties of electrically elicited saccades. Biol. Cybern. 60, 171–183. Wilson, M. A., and McNaughton, B. 1993. Dynamics of the hippocampal ensemble code for space. Science 261, 1055–1058.

A Simple Neural Network Exhibiting Selective Activation

97

Wolfe, W. J., Mathis, D., Anderson, C., Rothman, J., Gotter, M., Brady, G., Walker, R., Duane, G., and Alaghband, G. 1991. K-winner networks. IEEE Trans. Neural Networks 2, 310–315. Wurtz, R. H., and Munoz, D. P. 1994. Organization of saccade related neurons in monkey superior colliculus. In Contemporary Ocular Motor and Vestibular Research: A Tribute to David A. Robinson, A. F. Fuch, T. Brandt, U. Buttner, and D. S. Zee, eds., pp. 520–527. Springer-Verlag, Berlin. Wurtz, R. H., Richmond, B. J., and Judge, S. J. 1980. Vision during saccadic eye movements III: Visual interactions in monkey superior colliculus. J. Neurophysiol. 43, 1168–1181. Young, M. P., and Yamane, S. 1992. Sparse population coding of faces in the inferotemporal cortex. Science 256, 1327–1331. Yuille, A. L., and Grzywacz N. M. 1989. A winner-take-all mechanism based on presynaptic inhibition feedback. Neural Comp. 1, 334–347.

Received April 25, 1995; accepted April 11, 1996.

Communicated by Eric Baum

Playing Billiards in Version Space P´al Ruj´an Fachbereich 8 Physik and ICBM, Postfach 2503, Carl-von-Ossietzky Universit¨at, 26111 Oldenburg, Germany

A ray-tracing method inspired by ergodic billiards is used to estimate the theoretically best decision rule for a given set of linear separable examples. For randomly distributed examples, the billiard estimate of the single Perceptron with best average generalization probability agrees with known analytic results, while for real-life classification problems, the generalization probability is consistently enhanced when compared to the maximal stability Perceptron. 1 Introduction Neural networks can be used for both concept learning (classification) and for function interpolation and/or extrapolation. Two basic mathematical methods seem to be particularly adequate for studying neural networks: geometry (especially combinatorial geometry) and probability theory (statistical physics). Geometry is illuminating, and probability theory is powerful. In this article I consider what is perhaps the simplest neural network, the venerable Perceptron (Rosenblatt 1958): Given a set of examples falling in two classes, and find the linear discriminant surface separating the two sets, if one exists. In this context, my goal is twofold: (1) give the optimal (or Bayesian) decision theory a geometric interpretation and (2) use that geometric content for designing a deceptively simple method for computing the single Perceptron with the best possible generalization probability. As shown in Figure 1, a Perceptron is a network consisting of N (binary or real) inputs xi and a single binary output neuron. The output neuron sums all inputs with the corresponding synaptic weights wi , then performs a threshold operation according to Ã σ = sign

N X

! xi wi − θ

E − θ] = ±1. = sign[(E x, w)

(1.1)

i=1

In general, the synaptic weights and the threshold can be arbitrary real E has only binary components is called a binary numbers. A network where w Perceptron. The output unit has binary values σ = ±1 and labels the class to E = (w1 , w2 , . . . , wN ) which the input vector belongs. Both the weight vector w Neural Computation 9, 99–122 (1997)

c 1997 Massachusetts Institute of Technology °

100

P´al Ruj´an

Figure 1: The Perceptron as neural network.

and the threshold θ should be learned from a set of M examples whose class is known (the training set). The Perceptron and its variants (Hertz et al. 1991) are among the few networks for which basic properties like the maximal information capacity (how many random1 examples can be faultlessly stored in the network) or the classification error of certain learning algorithms (how many examples are needed in order to achieve a given generalization probability) can be obtained analytically. Such calculations are done by first defining a learning protocol. For example, one assumes that the examples are independently and identically sampled from some stationary probability distribution; that one has access to a very large set of such examples for both training and testing purposes; that no matter how many examples one generates; that there is always a single Perceptron network solving the problem without errors; and so forth. Almost all statistical mechanical calculations also require the thermodynamic limit N → ∞, M → ∞, α = M/N = const. These assumptions are needed for technological reasons—some integrals 1 By random vectors we mean in the following vectors sampled independently and identically from a constant distribution defined on the surface of the unit sphere (E x, xE) = 1.

Playing Billiards in Version Space

101

must be concretely evaluated—but also because otherwise the problem is mathematically not well defined. In most practical situations, however, one is confronted with situations that do not satisfy some of these assumptions: One has a relatively small set of examples; their distribution is unknown and often not stationary; the examples are not independently sampled; and so forth. In a strict sense, such problems are mathematically not quantifiable. However, even a small set of examples contains some information about the nature of the problem (Polya ´ 1968). The basic question is then, among all possible networks trained on this set, which one has possibly the best generalization probability? As explained below, understanding the geometry of the version space leads to the correct answer and to algorithms for computing it. Our protocol assumes that both the training and test examples are drawn independently and identically from the distribution P(ξE ). The simplest way to generate such a rule is to define a teacher network with the same structure as shown in Figure 1, providing the correct class label σ = ±1, for M , ξE (ν) ∈ RN and their each input vector. Given a set of M examples {ξE (ν) }ν=1 corresponding binary class label σν = ±1, the task of the learning algorithm is to find a student network that mimics the teacher by choosing the network parameters that correctly classify the training examples. Equation 1.1 E xE) = θ implies that the learning task consists of finding a hyperplane (w, M : separating the positive from the negative labeled examples xE ∈ {ξE (ν) }ν=1 E ξE (α) ) ≥ θ1 , σα = 1 (w, E ξE (β) ) ≤ θ2 , σβ = −1 (w,

(1.2)

θ1 ≥ θ ≥ θ2 . The typical situation is that either none or infinitely many such solutions E θ) is called the version exist. The vector space whose points are the vectors (w, space and its subspace satisfying equation 1.2 the solution polyhedron. Most often one wishes to minimize the average probability of class error for a new xE vector drawn from P(ξE ). In other situations another network, that in average makes more errors but avoids very “bad” ones, is preferable. Yet another problem occurs when the data provided by the teacher are noisy (Opper and Haussler 1992). In what follows we shall restrict ourselves to the standard problem of minimal average classification error. In theory, one knows that for a given training set, the optimal Bayes E θ} soludecision (Opper and Haussler 1991) implies an average over all {w, tions satisfying equation 1.2. Since each solution is perfectly consistent with the training set, in the absence of any other a priori knowledge, one must consider them as equally probable. This is not the case, for instance, in the presence of noise. The best strategy is then to associate with each point in the version space a Boltzmann weight whose energy is the squared error function and whose temperature depends on the noise strength (Opper and Haussler 1992).

102

P´al Ruj´an

For examples independently and identically drawn from a constant distribution (P(ξE ) = const) Watkin (1993) has shown that in the thermodynamic limit the single Perceptron corresponding to the center of mass of the solution polyhedron has the same generalization probability as the Bayes decision. On the practical side, known learning algorithms like Adaline (Widrow and Hoff 1960; Diedrich and Opper 1987) or the maximal stability Perceptron (MSP) (Vapnik 1982; Anlauf and Biehl 1989; Ruj´an 1993) are not Bayes optimal. In fact, if all input vectors have the same length, then the maximal stability Perceptron network corresponds to the center of the largest spherical cone inscribed into the solution polyhedron, as shown in Section 4. There have been several attempts at developing learning algorithms approximating the Bayes decision. Watkin (1993) used a randomized AdaTron algorithm to sample the version space. More recently, Bouten et al. (1995) defined a class of convex functions whose minimum lies within the solution polyhedron. By changing the function’s parameters, one can obtain a minimum, which is a good approximation of the known Bayes solution. This idea is somewhat similar to the notion of an analytic center of a convex polytope introduced by Sonnevend (1988) and used extensively for designing fast linear programming algorithms (Vaidya 1990). A very promising but not yet fully exploited approach (Opper 1993) uses the Thouless-Anderson-Palmer (TAP) equations originally developed for the spin-glass problem. In this paper I propose a quite different method, based on an analogy to classical, ergodic billiards. Considering the solution polyhedron as a dynamic system, a long trajectory is generated and used to estimate the center of mass of the billiard. The same idea can be applied also to other optimization problems. Questions related to the theory of billiards are briefly considered in Section 2. Section 3 sets the stage by presenting an elementary geometric analysis in two dimensions. The more elaborated geometry of the Perceptron version space is discussed in Section 4. An elementary algorithmic implementation for open polyhedral cones and their projective geometric closures are summarized in Section 5. Numerical results and a comparison to known analytic bounds and to other learning algorithms can be found in Section 6. Conclusions and further prospects are summarized in Section 7. 2 Billiards A billiard is usually defined as a closed space region (compact set) P ∈ RN dimensions. The boundaries of a billiard are usually piecewise smooth functions. Within these boundaries, a point mass (ball) moves freely, except for the elastic collisions with the enclosing walls. Hence, the absolute value of the momentum is preserved, and the phase space, B = P × S N−1 , is the direct product of P and S N−1 , the surface of the N-dimensional unit velocity sphere. Such a simple Hamiltonian dynamics defines a flow and its

Playing Billiards in Version Space

103

Poincar´e map an automorphism. The mathematicians have defined a finely tuned hierarchy of notions related to such dynamic systems. For instance, simple ergodicity as implied by the Birkhoff-Hincsin theorem means that the average of any integrable function defined on the phase space over a single but very long trajectory equals the spatial mean (except for a set of zero measure). Furthermore, integrable functions invariant under the dynamics must be constant. From a practical point of view, this means that almost all very long trajectories will cover the phase space uniformly. Properties like mixing (Kolmogorov mixing) are stronger than ergodicity. They require the flow to mix uniformly different subsets of B . In hyperbolic systems one can go even further and construct Markov partitions defined on symbolic dynamics and eventually prove related central limit theorems. Not all convex billiards are ergodic. Notable exceptions are ellipsoidal billiards, which can be solved by a proper separation of variables (Moser 1980). Already Jacobi knew that a trajectory started close to and along the boundaries of an ellipse cannot reach a central region bounded by the socalled caustics. In addition to billiards that can be solved by a separation of variables, there are a few other exactly soluble polyhedral billiards (Kuttler and Sigillito 1984). Such solutions are intimately related to the reflection method— the billiard tiles the entire space perfectly. A notable example is the equilateral triangle billiard, first solved by Lam´e in 1852. Other examples of such integrable billiards can be obtained by mapping an exactly soluble onedimensional, many-particle system into a one-particle, high-dimensional billiard (Krishnamurthy et al. 1982). Apart from these exceptions, small perturbations in the form of the billiard usually destroy integrability and lead to chaotic behavior. For example, the stadium billiard (two half-circles joined by two parallel lines) is ergodic in a strong sense; the metric entropy is nonvanishing (Bunimovich 1979). The dynamics induced by the billiard is hyperbolic if at any point in phase space there are both expanding (unstable) and shrinking (stable) manifolds. A famous example is Sinai’s reformulation of the Lorentz gas problem. Deep mathematical methods were needed to prove the Kolmogorov mixing property and in constructing the Markov partitions for the symbolic dynamics of such systems (Bunimovich and Sinai 1980). The question whether a particular billiard is ergodic can be decided in principle by solving the Schrodinger ¨ problem for a free particle trapped in the billiard box. If the eigenfunctions corresponding to the high-energy modes are roughly constant, then the billiard is ergodic. Only a few general results are known for such quantum problems. In fact, I am not aware of theoretical results concerning the ergodic properties of convex polyhedral billiards in high dimensions. If all angles of the polyhedra are rational, then the billiard is weakly ergodic in the sense that the velocity direction will reach only rational angles (relative to the initial direction). In general, as long as two neighboring trajectories collide with the same polyhedral faces,

104

P´al Ruj´an

their distance will grow only linearly. Once they are far enough to collide with different faces of the polyhedron, their distance will abruptly increase. Hence, except for very special cases with high symmetry, it seems unlikely that high-dimensional convex polyhedra as generated by the training examples will fail to be ergodic. 3 A Simple Geometric Problem For the sake of simplicity let us illustrate the approach in a two-dimensional setting. In the next section we will show how the concepts developed here generalize to the more challenging Perceptron problem. Let P be a closed convex polygon and vE a given unit vector, defining a particular direction in the R2 space. The direction of vector vE can be also described by the angle φ it makes with the x-axis. Next, construct the line perpendicular to vE which halves the area of the polygon P , A1 = A2 (see Fig. 2a). In the next section we will show that this geometric construction is analog to the Bayes decision in version space. Choosing a set of vectors vE oriented at different angles leads to the set of “Bayes lines” seen in Figure 2b. It is obvious that these lines do not intersect at one single point. The same task is computationally not feasible in a high-dimensional version space. It is certainly more economical to compute and store a single point Er0 , which represents optimally the full information contained in Figure 2b. As evident from Figure 2b, the majority of lines passing through Er0 will not partition P in equal areas but will make some mistakes, denoted by 1A = A1 − A2 6= 0. 1A depends on both the direction of vE and on the v, P ). Different optimality criteria can be formulated polygon P , 1A = 1A(E depending on the actual application. The usual approach corresponds to minimizing the squared area-difference averaged over all possible vE directions: Z Er0 = arg min h(1A)2 i = arg min

E (1A)2 (E v)p(E v)dv

(3.1)

where p(E v) is some a priori known distribution of vectors vE. Another possible criterion optimizes the worst-case loss over all directions: n o Er1 = arg inf sup (1A)2 .

(3.2)

In what follows, the point Er0 is called the Bayes point. The calculation of the Bayes point according to equation 3.1 is computationally feasible, but a lot of computer power is still needed. A good estimate

Playing Billiards in Version Space

105

(a)

(b)

Figure 2: (a) Halving the area along a given direction. (b) The resulting Bayes lines.

of the Bayes point is given by the center of mass of the polgon P : R Erρ(Er)dA SE = RP r)dA P ρ(E where ρ(Er) = const is the surface mass density.

(3.3)

106

P´al Ruj´an

Table 1: Exact Coordinates (x, y) of the Bayes Point for the Polygon P = (1, 0), (4, 6), (9, 4), (11, 0), (9, −2) and Its Various Estimates: Center of Mass, Maximal Inscribed Circle, Trajectories of Different Lengths. Method

x

Bayes point Center of mass Largest inscribed circle

6.1048 6.1250 5.4960

Billiard—10 collisions Billiard—102 collisions Billiard—103 collisions Billiard—104 collisions Billiard—105 collisions Billiard—106 collisions

6.0012 6.1077 6.1096 6.1232 6.1239 6.1247

σx

y

σy

1.7376 1.6667 2.0672 0.701 0.250 0.089 0.028 0.010 0.003

1.6720 1.6640 1.6686 1.6670 1.6663 1.6667

h1A2 i 1.4425 1.5290 7.3551

0.490 0.095 0.027 0.011 0.004 0.003

7.0265 2.6207 1.6774 1.5459 1.5335 1.5295

Table 1 presents exact numerical results for the x and y coordinates of both the Bayes point and the center of mass. The center of mass is an excellent approximation of the Bayes point. In very high dimensions, as shown by E T. Watkin (1993), Er0 → S. A polyhedron can be described by either the list of its vertices or the set of vectors normal to its facets. The transition from one representation to another requires exponential many arithmetic operations as a function of dimension. Therefore, in typical classification tasks, this transformation is not practicable. For “round” polygons (polyhedra), the center of the smallest circumscribed and the largest inscribed circle (sphere) is a good choice for approximating the Bayes point in the vertex and the facet representation, respectively. Since in our case the polyhedron generalizing P is determined by a set of normal vectors, only the largest inscribed circle (sphere) is a feasible approximation (see Fig. 3). The numerical values of the (xR , yR ) coordinates of the center point are displayed in Table 1. A better approximation would be to compute the center of the largestvolume inscribed ellipsoid (Ruj´an in preparation), a problem also common in nonlinear optimization (Khachiyan and Todd 1993). The best-known algorithms are of order O(M3.5 ) operations, where M is in our case the number of examples. Additional logarithmic factors have been neglected (for details, see Khachiyan and Todd 1993). The purpose of this paper is to show that a reasonable estimate of rE0 can be obtained faster by following the (ergodic) trajectory of an elastic ball inside the polygon, as shown in Figure 4a for four collisions. Figure 4b shows the trajectory of Figure 4a from another perspective, by performing an appropriate reflection on each collision edge. A trajectory is periodic if

Playing Billiards in Version Space

107

Figure 3: The largest inscribed circle. The center of mass (cross) is plotted for comparison.

after a finite number of such reflections the polygon P is mapped onto itself. Fully integrable systems correspond in this respect to polygons that will fill without holes the whole space (that this geometric point of view applies also to other fully integrable systems is nicely exposed in Sutherland 1985). If the dynamics is ergodic in the sense discussed in Section 2, then a long enough trajectory should cover without holes the surface of the polygon. By computing the total center of mass of the trajectory, one should then obtain a good estimate of the center of mass. The question of whether a billiard formed by a “generic” convex polytope is ergodic is to my knowledge not solved. Extensive numerical calculations are possible only in low dimensions. The extent to which the trajectory covers P after 1000 collisions is visualized in Figure 5. By continuing this procedure, one can convince oneself that all holes are filled up, so that the trajectory will visit every point inside P . The next question is whether the area of P is homogeneously covered by the trajectory. The numerical results summarized in Table 1 were obtained by averaging over 100 different trajectories for each given length. As the length of the trajectory is increased, these averages converge to the center of mass. Also displayed are the empirical standard deviations σx,y of the trajectory center of mass coordinates and the average squared area-difference, h1A2 i.

108

P´al Ruj´an

(a)

(b)

Figure 4: (a) A trajectory with four collisions. (b) Its “straightened” form.

Playing Billiards in Version Space

109

Figure 5: A trajectory with 1000 collisions. To improve the resolution, a dotted line has been used.

4 The Geometry of Perceptrons Consider a set of M training examples, consisting of N-dimensional vectors ξE (ν) and their class σν , ν = 1, . . . , M. Now let us introduce the N + 1(ν) E = (w, E θ). In this representadimensional vectors ζE = (σν ξE (ν) , −σν ) and W tion equation 1.2 becomes equivalent to a standard set of linear inequalities E ζE (W,

(ν)

) ≥ 1 > 0.

(4.1)

E ζE (ν) ) is called the stability of the linear The parameter 1 = min{ν} (W, inequality system. A bigger 1 implies a solution that is more robust against small changes in the example vectors. (ν) The vector space whose points are the examples ζE is the example space. E corresponds to the normal vector of an N + 1-dimensional In this space W E Hence, a given hyperplane. The version space is the space of the vectors W. E example vector ζ corresponds here to the normal of a hyperplane. The inequalities (equation 4.1) define a convex polyhedral cone whose boundary (ν) M . hyperplanes are determined by the training set {ζE }ν=1 How can one use the information contained in the example set for making the best average prediction on the class label of a new vector drawn from

110

P´al Ruj´an (new)

the same distribution? Each new presented example ζE corresponds to a hyperplane in version space. The direction of its normal is defined up to a σ = ±1 factor, the corresponding class. The best possible decision for the classification of the new example follows the Bayes scheme: For each new (test) example, generate the corresponding hyperplane in version space. If this hyperplane does not intersect the solution polyhedron, consider the normal to be positive when pointing to the part of the version space containing the solution polyhedron. Hence, all Perceptrons satisfying equation 4.1 will classify unanimously the new example. If the hyperplane cuts the solution polyhedron in two parts, point the normal toward the bigger half. Therefore, the decision that minimizes the average generalization error is given by evaluating the average measure of pro versus the average measure of contra votes of all Perceptrons making no errors on the training set. We see that the Bayes decision is analogous to the geometric problem described in the previous section. However, the solution polyhedral cone is either open or is defined on the unit N + 1-dimensional hypersphere (see also Fig. 3.11b for an illustration). The practical question is how to find simple approximations for the Bayes point. One possibility is to consider the Perceptron with maximal stability, defined as the following: E W) E =1 E MSP = arg max E 1; (W, W W

(4.2)

or, equivalently, as E w) E =1 E MSP = arg maxwE {θ1 − θ2 }; (w, w

(4.3)

where θ = (θ1 + θ2 )/2 and 1 = (θ1 − θ2 )/2. The quadratic conditions E W) E = 1 [(w, E w) E = 1] are necessary because otherwise one could mul(W, E and θ1,2 by a large number, making θ1 − θ2 and thus 1 arbitrarily tiply w large. Equation 4.3 has a very simple geometric interpretation, shown in Figure 3.11. Consider the convex hulls of the vectors ξE (α) belonging to the positive examples σα = 1 and of those in the negative class ξE (β) , σβ = −1, respectively. According to equation 4.3, the Perceptron with maximal stability corresponds to the slab of maximal width one can put between the two convex hulls (the “maximal dead zone” [Lampert 1969]). Geometrically, this problem is equivalent (dual) to finding the direction of the shortest line segment connecting the two convex hulls (the minimal connector problem). Since the dual problem minimizes a quadratic function subject to linear 2 constraints, it is a quadratic programming problem. By choosing θ = θ1 +θ 2 (dotted line in Fig. 3.11 ) one obtains the MSP. E is determined by at most N + 1 vertices taken from The direction of w both convex hulls, called active constraints. Figure 3.11a shows a simple two-

Playing Billiards in Version Space

111

dimensional example; the active constraints are labeled A, B, and C, respectively. The version space is three-dimensional, as shown in Figure 3.11b. The three planes represent the constraints imposed by the examples A (left plane), B (right plane), and C (lower plane). The bar points from the origin to the point defined by the MSP solution. The sphere corresponds to the normalization constraint. If the example vectors ξE (ν) all have the same length E MSP N , then equations 4.1 and 4.2 imply that the distances between the W and the hyperplanes corresponding to active constraints are all equal to 1max N . All other hyperplanes participating in the polyhedral cone are farther away. Accordingly, the MSP corresponds to the center of the largest circle inscribed into the spherical triangle defined by the intersection of the unit sphere with the solution polyhedron, the point where the bar intersects the sphere in Figure 3.11b. A fast algorithm for computing the minimal connector requiring on average O(N2 M) operations and O(N2 ) storage place can be found in Ruj´an (1993). 5 How to Play Billiards in Version Space Each billiard game starts by first placing the ball(s) on the pool. This is not always a trivial task. In our case, the MSP algorithm (Ruj´an 1993) does it or signals that a solution does not exist. The trajectory is initiated by generating E in version space. at random a unit direction vector v The basic step consists of finding out where—on which hyperplane—the next collision will take place. The idea is to compute how much time the ball needs until it eventually hits each one of the M hyperplanes. Given a point E = (w, E θ ) in version space and a unit direction vector v E , let us denote W the distance along the hyperplane normal ζE by dn and the component of vE perpendicular to the hyperplane by vn . In this notation the flight time needed to reach this plane is given by E ζE ) dn = (W, vn = (E v, ζE ) τ =−

(5.1)

dn . vn

After computing all M flight times, one looks for the smallest positive τmin = min{ν} τ > 0. The collision will take place on the corresponding hyperplane. E 0 and the new direction v E 0 are calculated as The new point W E0 =W E + τmin v E. W 0 E =v E − 2vn ζE . v

(5.2) (5.3)

This procedure is illustrated in Figure 7a. In order to estimate the center of E and W E 0 . By assuming mass of the trajectory, one must first normalize both W

112

P´al Ruj´an

(a)

(b)

Figure 6: (a) The Perceptron with maximal stability in example space. (b) The solution polyhedron (only the bounding examples A, B, and C are shown). See text for details.

Playing Billiards in Version Space

113

Figure 7: Bouncing in version space. (a) Euclidean geometry. (b) Spherical geometry.

a constant line density, one assigns to the (normalized!) center of the segment E 0 +W E W E 0 − W. E This is then added to the actual center the length of the vector W 2 of mass—as when adding two parallel forces of different lengths. In high dimensions (N > 5), however, the difference between the mass of the full N + 1 dimensional solution polyhedron and the mass of the bounding Ndimensional boundaries becomes negligible. Hence, we could just as well record the collision points, assign them the same mass density, and construct their average. Note that by continuing the trajectory beyond the first collision plane, E one can also sample regions of the solution space where the network W makes one, two, and more mistakes (the number of mistakes equals the number of crossed boundary planes). This additional information can be used for taking an optimal decision when the examples are noisy (Opper and Haussler 1992). Since the polyhedral cone is open, the implementation of this algorithm must take into account the possibility that the trajectory might escape to infinity. The minimal flight time then becomes very large: τ > τmax . When this exception is detected, a new trajectory is started from the MSP point in yet another random direction. Hence, from a practical point of view, the polyhedral solution cone is closed by a spherical shell with radius τmax acting as a special “scatterer.” This flipper procedure is iterated until enough data are gathered.

114

P´al Ruj´an

If we are class conscious and want to remain in the billiard club, we must do a bit more. The solution polyhedral cone can be closed by normalizing the version space vectors. The billiard is now defined on a curved space. However, the same strategy works here also if between subsequent collisions one follows geodesics instead of straight lines. Figure 7b illustrates the change in direction for a small time step, leading to the well-known geodesic differential equation on the unit sphere: E˙ = v E. W ˙v E E = −W.

(5.4) (5.5)

The solution of these equations costs additional resources. Actually, the solution of the differential equation is strictly necessary only when there are no bounding planes on the actual horizon (assuming light travels along Euclidean straight lines). Once one or more boundaries are “visible,” the choice of the shortest flight time can be evaluated directly, since in the two geometries the flight time is monotonously deformed. Even so, the flipper procedure is obviously faster. Both variants deliver interesting additional information, like the mean escape time of a trajectory or the number of times a given border plane has been bounced on. The collision frequency classifies the training examples according to their “surface” area in the solution polyhedron—a good measure of their relative “importance.” 6 Results and Performance This section contains the results of numerical experiments performed to test the billiard algorithm. First, the ball is placed inside the billiard with the MSP algorithm (as described in Ruj´an 1993). Next, a number of typically O(N2 ) collision points are generated. Since the computation of one collision point requires M scalar products of N-dimensional vectors, the total load of this algorithm is O(MN3 ). The choice for N2 collision points is somewhat arbitrary and is based on the following considerations. By using the billiard method, one generates many collision points lying on the borders of the solution polyhedron. We could try to use this information for approximating the solution polyhedron with an ellipsoidal cone. The number of free parameters involved in the fit is of the order O(N2 ). Hence, at least a constant times that many points are needed. Such a fitted ellipsoid also delivers an estimate on the decision uncertainty. If one is not interested in this information, it is enough to monitor how one or more projections of the center of mass estimate changes as the number of collisions increases. Once these projections become stable, the program can be stopped. T. Watkin (1993) argues that the number of sampling points should be of O(1). I find his arguments unconvincing. For example, Figure 8 shows how

Playing Billiards in Version Space

115

Figure 8: Average of the generalization probability as a function of the number of collisions, N = 100, M = 1000.

the average generalization probability changes as the number of collisions increases up to N2 steps. Finding a point within the solution polyhedron is algorithmically equivalent to solving a linear programming problem. Therefore, his method of sampling the version space with randomly started AdaTron gradient descent (or any other Perceptron learning method) requires at least O(N2 M) operations per sampling point, compared to O(NM) per collison in the billiard method. In a first test, a known “teacher” Perceptron TE was used to label the randomly generated examples for training. The generalization probability G(α), α = M/N was then computed by measuring the overlap between the resulting solution (“student”) Perceptron with the teacher Perceptron: G(α) = 1 −

1 E w); E T) E = (w, E (T, E w) E = 1. cos−1 (T, π

(6.1)

The numerical results obtained from 10 different realizations for N = 100 are compared with the theoretical Bayes learning curve in Figure 9. Figure 10 shows a comparison between the billiard results and the MSP results. Although the differences seem small compared to the error bars, the billiard solution was in all realizations consistently superior to the MSP. Figure 11 shows how the number of constraints (examples) bordering the solution polyhedron changes with increasing α = M/N.

116

P´al Ruj´an

Figure 9: The theoretical Bayes learning curve (solid line) versus billiard results obtained in 10 independent trials. G(α) is the generalization probability, α = M , N N = 100.

As the number of examples increases, the probability of escape from the solution polyhedron decreases, and the network reaches its storage capacity. Figure 12 shows the average number of collisions before escape as a function of classification error, parameterized through α = M/N. Therefore, by measuring either the escape rate or the number of “active” examples, we can estimate the generalization error without using test examples. Note, however, that such calibration graphs should be calculated for each specific distribution of the input vectors. Randomly generated training examples lead to rather isotropic polyhedra, as illustrated by the small difference between the Bayes and the MSP learning curves (see Fig. 10). Therefore, we expect that the billiard approach leads to bigger improvements when applied to real-life problems with strongly anisotropic solution polyhedra. Similar improvements can be expected when using constructive methods for multilayer Perceptrons that use iteratively the Perceptron algorithm (Marchand et al. 1989). Such procedures have been used, for example, for classifying handwritten digits (Knerr et al. 1990). For all such example sets available to me, the introduction of the billiard algorithm on top of the MSP leads to consistent improvements of up to 5% in classification probability.

Playing Billiards in Version Space

117

Figure 10: Average generalization probability, same parameters as in Figure 9. Lower curve: the maximal stability Perceptron algorithm. Upper curve: the billiard algorithm.

A publicly available data set known as the sonar problem (Gorman and Sejnowski 1988; Fahlman N.d.) considers the problem of deciding between rocks and mines from sonar data. The input space is N = 60 dimensional and the whole set consists of 111 mine and 97 rock examples. We go down the list of rock signals putting alternate members into the training and test sets; we do the same with the set of mine signals. Since the data sets are sorted by increasing azimuth, this gives us training and testing with equal lengths (104 signals) and with the population of azimuth angles matched as closely as possible. Gorman and Sejnowski (1988) consider a three-layer feedforward neural network architecture with a different number of hidden units, trained by the backpropagation algorithm. Their results are summarized and compared to our results in Table 2. By applying the MSP algorithm, we first found that the whole data (training plus test) set is linearly separable. Second, by using the MSP on the training set we obtain a 77.5% classification rate on the test set (compared to 73.1% in Gorman and Senjowski 1988). Playing billiard leads to an 83.4% classification rate (in both cases, the training set was faultlessly classified). This improvement amount is also typical for other applications. The num-

118

P´al Ruj´an

Figure 11: Number of different identified polyhedral borders versus G(α), N = 100. Diamonds: maximal stability Perceptron, saturating at 101. Crosses: billiard algorithm.

Figure 12: Mean number of collisions before escaping the solution polyhedral cone versus G(α), N = 100.

Playing Billiards in Version Space

119

Table 2: Results for the Sonar Classification Problem from Gorman and Sejnowski (1988). Hidden units

% right on training set

Standard deviation

% right on test set

Standard deviation

0

79.3

3.4

73.1

4.8

0–MSP 0–Billiard

100.0 100.0

— —

77.5 83.4

— —

2 3 6 12 24

96.2 98.1 99.4 99.8 100.0

2.2 1.5 0.9 0.6 0.0

85.7 87.6 89.3 90.4 89.2

6.3 3.0 2.4 1.8 1.4

Note: 0-MSP is the maximal stability Perceptron; 0-Billiard is the Bayes billiard estimate.

ber of active examples (those contributing to the solution) was 42 for the MSP and 55 during the billiard. By computing the MSP and the Bayes Perceptron, we did not use any information available on the test set. On the contrary, many networks trained by backpropagation are slightly adjusted to the test set by changing network parameters—output unit biases and/or activation function decision bounds. Other training protocols also allow either such adjustments or generate a population of networks, from which a “best” is chosen based on test set results. Although such adaptive behavior might be advantageous in many practical applications, it can be misleading when trying to infer the real capability of the trained network. In the sonar problem, for instance, we know that a set of Perceptrons separating faultlessly the whole data (training plus test) set is included in the version space. Hence, one could use the billiard or other method to find it. This would be an extreme example of “adapting” our solution to the test set. Such a procedure is especially dangerous when the test set is not “typical.” Since in the sonar problem the data were divided in two equal sets, by exchanging the roles of the training and test sets one would expect similar quality results. However, in this case, much weaker results (73.3% classification rate) are obtained. This shows that the two sets do not contain the same amount of information about the common Perceptron solution. Looking at the results of Table 2 it is hard to understand how 104 training examples could substantiate the excellent average test set error of 90.4±1.8% for networks with 12 hidden units (841 free parameters). The method of

120

P´al Ruj´an

structural risk minimization (Vapnik 1982) uses uniform bounds for the generalization error in networks with firmly different Vapnik-Chervonenkis (1971) dimensions. To establish a classification probability of 90%, one needs about 10 times more training examples for the linearly separable class of functions. A network with 12 hidden units certainly has a much larger capacity and requires that many more examples. One could argue that each of the 12 hidden units has solved the problem on its own, and thus the network acts as a committee machine. However, such a majority decision should be at best comparable to the Bayes estimate.

7 Conclusions and Prospects The study of dynamic systems has led to many interesting practical applications in time-series analysis, coding, and chaos control. The elementary application of Hamiltonian dynamics presented in this paper demonstrates that the center of mass of a long dynamic trajectory bouncing back and forth between the walls of the convex polyhedral solution cone leads to a good estimate of the Bayes decision rule for linearly separable problems. Somewhat similar ideas have been recently applied to constrained nonlinear programming (Coleman and Li 1994). Although the geometric view presented in this paper overemphasizes the role played by extremal (active) vertices of the example set and faultlessly trained Perceptrons, it is not difficult to make these estimates more robust. For example, some of the extremal examples can be removed to test—and improve—the stability of the MSP. Similarly, the solution polyhedron can be expanded to include solutions with nonzero training errors, allowing for learning noisy examples. Since the Perceptron problem is equivalent to the linear inequalities problem (equation 4.1), it has the same algorithmic complexity as linear programming. The theory of convex polyhedra plays a central role in both mathematical programming and solving N P -hard problems such as the traveling salesman problem (Lawler and Lenstra 1984; Padberg and Rinaldi 1987). Viewed from this perspective, the ergodic theory of convex polyhedral billiards might provide new, effective tools for solving different combinatorial optimization problems (Ruj´an in preparation). A big advantage of such ray-tracing algorithms is that they can be run in parallel by following up several trajectories at the same time. The success of further applications depends on methods of making such simple dynamics strongly mixing. On the theoretical side, more general results, applicable to large classes of convex polyhedral billiards, are called for. In particular, a good estimate of the average escape (or typical mixing) time is needed in order to bound the average behavior of future “ergodic” algorithms.

Playing Billiards in Version Space

121

Acknowledgments I thank Manfred Opper for the analytic Bayes data and Bruno Eckhardt for discussions on billiards. For a LATEX original of this article with PostScript figures and a C-implementation of the billiard algorithm, point your browser at http://www.icbm.uni-oldenburg.de/∼ rujan/rujan.html. References Anlauf, J., and Biehl, M. 1989. The AdaTron: An Adaptive Perceptron algorithm. Europhysics Letters 10, 687. Bouten, M., Schietse, J., and Van den Broeck, C. 1995. Gradient descent learning in Perceptrons: A review of possibilities. Physical Review E 52, 1958. Bunimovich, L. A. 1979. On the ergodic properties of nowhere dispersing billiards. Communications in Mathematical Physics 65, 295. Bunimovich, L. A., and Sinai, Y. G. 1980. Markov partitions for dispersed billiards. Communications in Mathematical Physics 78, 247. Coleman, T. F., and Li, Y. 1994. On the convergence of interior-reflective Newton methods for nonlinear minimization subject to bounds. Mathematical Programming 67, 189. Diederich, S., and Opper, M. 1987. Learning of correlated patterns in spin-glass networks by local learning rules. Physical Review Letters 58, 949. Fahlman, S. E. N.d. Collection of neural network data. Included in the public domain software am-6.0. Gorman, R. P., and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75. Hertz, J., Krogh, A., and Palmer, R. G. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Reading, MA. Khachiyan, L. G., and Todd, M. J. 1993. On the complexity of approximating the maximal inscribed ellipsoid for a polytope. Mathematical Programming 61, 137. Knerr, S., Personnaz, L., and Dreyfus, G. 1990. Single layer learning revisited: A stepwise procedure for building and training a neural network. In Neurocomputing, F. Fogelman and J. H´erault, eds., Springer-Verlag, Berlin. Krishnamurthy, H. R., Mani, H. S., and Verma, H. C. 1982. Exact solution of Schrodinger ¨ equation for a particle in a tetrahedral box. Journal of Physics A 15, 2131. Kuttler, J. R., and Sigillito, V. G. 1984. Eigenvalues of the Laplacian in two dimensions. SIAM Review 26, 163. Lampert, P. F. 1969. Designing pattern categories with extremal paradigm information. In Methodologies of Pattern Recognition, M. S. Watanabe, ed., p. 359. Academic Press, New York. Lawler, E. L., Lenstra, J. K., Rinnoy Kan, A. H. G., and Shmoys, D. B., eds. 1984. The Traveling Salesman Problem. John Wiley, New York. Marchand, M., Golea, M., and Ruj´an, P. 1989. Convergence theorem for sequential learning in two layer perceptrons. Europhysics Letters 11, 487.

122

P´al Ruj´an

Moser, J. 1980. Geometry of quadrics and spectral theory. In The Chern Symposium, Y. Hsiang et al., eds., p. 147. Springer-Verlag, Berlin. Opper, M. 1993. Bayesian learning. Talk at the Neural Networks for Physicists 3. Minneapolis, MN. Opper, M., and Haussler, D. 1991. Generalization performance of Bayes optimal classification algorithm for learning a Perceptron. Physical Review Letters 66, 2677. Opper, M., and Haussler, D. 1992. Calculation of the learning curve of Bayes optimal classification algorithm for learning a Perceptron with noise. Contribution to IVth Annual Workshop on Computational Learning Theory (COLT91), Santa Cruz 1991, pp. 61–87. Morgan Kaufmann, San Mateo, CA. Padberg, M., and Rinaldi, G. 1987. Optimization of a 532-city symmetric traveling salesman by branch and cut. Oper. Res. Lett. 6, 1. Padberg, M., and Rinaldi, G. 1988. A branch-and-cut algorithm for the resolution of large-scale symmetric traveling salesman problems. IASI preprint R. 247. Polya, ´ G. 1968. Mathematics and Plausible Reasoning. Vol. 2: Patterns of Plausible Inference. 2d ed. Princeton University Press, Princeton, NJ. Rosenblatt, F. 1988. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65, 386. Ruj´an, P. 1993. A fast method for calculating the Perceptron with maximal stability. Journal de physique (Paris) I 3, 277. Ruj´an, P. In preparation. Ergodic billiards and combinatorial optimization. Sonnevend, Gy. 1988. New algorithms based on a notion of “centre” (for systems of analytic inequalities) and on rational extrapolation. In Trends in Mathematical Optimization—Proceedings of the 4th French-German Conference on Optimization, Irsee 1986 ISNM, K. H. Hoffmann, J.-B. Hiriart-Urruty, C. Lemarechal, and J. Zowe, eds., vol. 84 Birkhauser, Basel. Sutherland, B. 1985. An introduction to the Bethe Ansatz. In Exactly Solvable Problems in Condensed Matter and Relativistic Field Theory, B. S. Shastry, S. S. Jha, and V. Singh, eds., p. 1. Lecture Notes in Physics 242. Springer-Verlag, Berlin. Vaidya, P. M. 1990. A new algorithm for minimizing a convex function over convex sets. In Proceedings of the 30th Annual FOCS Symposium, Research Triangle Park, NC, 1989, pp. 338–343. IEEE Computer Society Press, Los Alamitos, CA. Vapnik, D. 1982. Estimation of dependencies from empirical data, SpringerVerlag, Berlin. Vapnik, V. N., and Chervonenkis, A. Y. 1971. Theor. Probab. Appl. 16, 264. Watkin, T. 1993. Optimal learning with a neural network. Europhysics Letters 21, 871. Widrow, B., and Hoff, M. E. 1960. Adaptive switching circuits. 1960 IRE WESCON Convention Record. IRE, New York.

Received August 14, 1995; accepted April 24, 1996.

Communicated by Raymond Watrous

Partial BFGS Update and Efficient Step-Length Calculation for Three-Layer Neural Networks Kazumi Saito Ryohei Nakano NTT Communication Science Laboratories, 2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02 Japan

Second-order learning algorithms based on quasi-Newton methods have two problems. First, standard quasi-Newton methods are impractical for large-scale problems because they require N2 storage space to maintain an approximation to an inverse Hessian matrix (N is the number of weights). Second, a line search to calculate a reasonably accurate step length is indispensable for these algorithms. In order to provide desirable performance, an efficient and reasonably accurate line search is needed. To overcome these problems, we propose a new second-order learning algorithm. Descent direction is calculated on the basis of a partial Broydon-Fletcher-Goldfarb-Shanno (BFGS) update with 2Ns memory space (s ¿ N), and a reasonably accurate step length is efficiently calculated as the minimal point of a second-order approximation to the objective function with respect to the step length. Our experiments, which use a parity problem and a speech synthesis problem, have shown that the proposed algorithm outperformed major learning algorithms. Moreover, it turned out that an efficient and accurate step-length calculation plays an important role for the convergence of quasi-Newton algorithms, and a partial BFGS update greatly saves storage space without losing the convergence performance. 1 Introduction The backpropagation (BP) algorithm (Rumelhart et al. 1986) has been applied to various classes of problems, and its usefulness has been proved. However, even with a momentum term, this algorithm often requires a large number of iterations for convergence. Moreover, the user is required to determine appropriate parameters by trial and error. To overcome these drawbacks, a learning rate maximization method (LeCun et al. 1993), learning rate adaptation rules (Jacobs 1988; Silva and Almeida 1990), smart algorithms such as QuickProp (Fahlman 1988) and RPROP (Riedmiller and Braun 1993), and second-order learning algorithms (Watrous 1987; Barnard 1992; Battiti 1992; Møller 1993b) based on nonlinear optimization techniques (Gill et al. 1981) have been proposed. Each has achieved a certain degree of Neural Computation 9, 123–141 (1997)

c 1997 Massachusetts Institute of Technology °

124

Kazumi Saito and Ryohei Nakano

success. Among these approaches, we believe that second-order learning algorithms should be investigated more because they theoretically have excellent convergence properties (Gill et al. 1981). At present, however, they have two problems. One is scalability. Second-order algorithms based on LevenbergMarquardt or quasi-Newton methods cannot suitably increase in scale for large problems. Levenberg-Marquardt algorithms generally require a large amount of computation until convergence, even for midscale problems that involve many hundreds of weights. They require O(N2 m) operations to calculate the descent direction during any one iteration; N denotes the number of weights and m the number of examples. Standard quasi-Newton algorithms (Watrous 1987; Barnard 1992) are rarely applied to large-scale problems that involve more than many thousands of weights because they require N2 storage space to maintain an approximation to an inverse Hessian matrix. In order to cope with this problem, the OSS algorithm (Battiti 1989) adopted the memoryless update (Gill et al. 1981); however, OSS may not provide desirable performance because the descent direction is calculated on the basis of a fast but rough approximation. The other problem is the burden of step-length computation. A line search to calculate an adequate step length is indispensable for second-order algorithms that are based on quasi-Newton or conjugate gradient methods. Since an inaccurate line search may provide undesirable performance, a reasonably accurate line search is needed. However, an exact line search based on existing methods generally requires a nonnegligible computational load, at least a few number of function and/or gradient evaluations. Since it is widely recognized that successful convergence of conjugate gradient methods relies heavily on the accuracy of a line search, it seems rather difficult to increase the speed of conjugate gradient algorithms. In the SCG algorithm (Møller 1993b), although the step length is estimated by using the model trust region approach (Gill et al. 1981), this method can be regarded as an efficient one-step line search method based on a one-sided difference equation. Fortunately, successful convergence of quasi-Newton algorithms is theoretically guaranteed even with an inaccurate line search if certain conditions are satisfied (Powell 1976), but efficient convergence cannot always be expected. Thus, we believe that if the step length can be calculated with greater efficiency and reasonable accuracy, the quasi-Newton algorithms will work better from two aspects: processing efficiency and convergence. For large-scale problems that include a large number of redundant training examples, on-line algorithms (LeCun et al. 1993), which perform a weight update for each single example, will work with greater efficiency than an offline counterpart. Although second-order algorithms are basically not suited to perform on-line updating, waiting for a sweep of many examples before updating is not reasonable. Toward the improvement, several attempts have been made to introduce “pseudo-on-line” updating into second-order algorithms, where updates are performed on smaller subsets of data (Møller

BFGS Update and Step-Length Calculation

125

1993c; Kuhn and Herzberg 1990). Thus, if a better second-order algorithm is developed, its pseudo-on-line version will work more efficiently. This paper is organized as follows. Section 2 describes a new secondorder learning algorithm based on a quasi-Newton method, called BPQ, where the descent direction is calculated on the basis of a partial BFGS update and a reasonably accurate step length is efficiently calculated as the minimal point of a second-order approximation. Section 3 evaluates BPQ’s performance in comparison with other major learning algorithms. 2 BPQ Algorithm 2.1 The Problem. Let {(x1 , y1 ), . . . , (xm , ym )} be a set of examples, where xt denotes an N-dimensional input vector and yt a target value corresponding to xt . In a three-layer neural network, let h be the number of hidden units, wi (i = 1, . . . , h) the weight vector between all the input units and the hidden unit i, and w0 = (w00 , . . . , w0h )T the weight vector between all the hidden units and the output unit; wi0 means a bias term and xt0 is set to 1. Note that aT denotes the transposed vector of a. Hereafter, a vector consisting of all parameters, (wT0 , . . . , wTh )T , is simply expressed as Φ; let N(= nh + 2h + 1) be the dimension of Φ. Then, learning in the three-layer neural network can be defined as the problem of minimizing the following objective function: f ( Φ) =

m 1X (yt − zt )2 , 2 t=1

(2.1)

P where zt = z(xt ; Φ) = w00 + hi=1 w0i σ (wTi xt ). σ (u) represents a sigmoidal function, σ (u) = 1/(1 + e−u ). Note that we do not employ nonlinear transformation at the output unit because it is not essential from the viewpoint of function approximation. 2.2 Quasi-Newton Method. A second-order Taylor expansion of f (Φ + 1Φ) about 1Φ is given as f (Φ) + (∇ f (Φ))T 1Φ + 12 (1Φ)T ∇ 2 f (Φ)1Φ. If ∇ 2 f (Φ) is positive definite, the minimal point of this expansion is given by 1Φ = −(∇ 2 f (Φ))−1 ∇ f (8). Newton techniques minimize the objective function f (Φ) by iteratively calculating the descent direction, 1Φ (Gill et al. 1981). However, since O(N3 ) operations are required to calculate (∇ 2 f (Φ))−1 directly, we cannot expect these techniques to increase suitably in scale for large problems. Quasi-Newton techniques, on the other hand, calculate a matrix H through iterations in order to approximate (∇ 2 f (Φ))−1 . The basic algorithm is described as follows (Gill et al. 1981): Step 1: Initialize Φ1 , set H1 = I (I: identity matrix), and set k = 1. Step 2: Calculate the current descent direction: 1Φk = −Hk gk , where gk = ∇ f (Φk ).

126

Kazumi Saito and Ryohei Nakano

Step 3: Step 4: Step 5: Step 6: Step 7:

Terminate the iteration if a stopping criterion is satisfied. Calculate the step length λk that minimizes f (Φk + λ1Φk ). Update the weights: Φk+1 = Φk + λk 1Φk . If k ≡ 0 (mod N), set Hk+1 = I; otherwise, update Hk+1 . Set k = k + 1, return to Step 2.

2.3 Existing Methods for Calculating Descent Directions. Several methods for updating Hk+1 have been proposed. Among them, the BroydonFletcher-Goldfarb-Shanno (BFGS) update (Fletcher 1980) was the most successful update in a number of studies. By putting pk = λk 1Φk and qk = gk+1 − gk , the BFGS formula is given as Hk+1 = Hk −

pk qTk Hk + Hk qk pTk pTk qk

Ã + 1+

qTk Hk qk pTk qk

!

pk pTk pTk qk

.

(2.2)

In large-scale problems that involve more than many thousands of weights, maintaining the approximation matrix H becomes impractical because it requires N2 storage space. In order to cope with this problem, the OSS (one-step secant) algorithm (Battiti 1989) adopted the memoryless BFGS update (Gill et al. 1981). By always taking the previous matrix Hk as the identity matrix (Hk = I), the descent direction of Step 2 is calculated as Ã ! pk qTk gk+1 + qk pTk gk+1 qTk qk pk pTk gk+1 − 1+ T .(2.3) 1Φk+1 = −gk+1 + pTk qk pk qk pTk qk Clearly equation 2.3 can be calculated with O(N) multiplications and storage space. However, OSS may not provide desirable performance because the descent direction is calculated on the basis of the above substitution. In our early experiments, this method worked rather poorly in comparison with the partial BFGS update proposed in the next section. 2.4 New Method for Calculating Descent Directions. In this section, we propose a partial BFGS update with 2Ns memory (s ¿ N), where the search directions are exactly equivalent to those of the original BFGS update during the first s + 1 iterations. The partiality parameter s means the length of the history described below. By putting rk = Hk qk , we have Hk gk+1 = Hk qk + Hk gk = rk −

pk . λk

Then, the descent direction based on equation 2.2 can be calculated as 1Φk+1 = −Hk+1 gk+1

BFGS Update and Step-Length Calculation

pk rTk gk+1 + rk pTk gk+1 pk + λk pTk qk Ã ! qTk rk pk pTk gk+1 − 1+ T . pk qk pTk qk

127

= −rk +

(2.4)

After calculating rk , equation 2.4 can be calculated using O(N) multiplications and storage space, just like the memoryless BFGS update. Now, we show that it is possible to calculate rk within O(Ns) multiplications and 2Ns storage space by using the partial BFGS update. Here, we assume k ≤ s. When k = 1, r1 (= H1 q1 = g2 − g1 ) can be calculated only by subtractions. When k > 1, we assume that each of r1 , . . . , rk−1 has been calculated and stored. Note that for i < k, 1 αi = T and βi = αi (1 + αi qTi ri ) pi qi have already been calculated during the iterations. Thus, by recursively applying equation 2.2, rk can be calculated with O(Nk) multiplications and 2Nk storage space, as follows: rk = Hk qk = Hk−1 qk − αk−1 pk−1 rTk−1 qk − αk−1 rk−1 pTk−1 qk + βk−1 pk−1 pTk−1 qk = qk +

k−1 X

(−αi pi rTi qk − αi ri pTi qk + βi pi pTi qk ) .

(2.5)

i=1

Next, when the number of iterations exceeds s+1, we have two alternatives: restarting the update by discarding the accumulated vectors or continuing the update by using the latest vectors. In both cases, equation 2.5 can be calculated within O(Ns) multiplications and 2Ns storage space. Therefore, it has been shown that equation 2.4 can be calculated within O(Ns) multiplications and 2Ns storage space. In our experiments, the former update was employed because the latter worked poorly when s was small. Although the idea of partial quasi-Newton methods has been briefly described (Luenberger 1984), strictly speaking, our update is different. The earlier proposal intended to store the vectors pi and qi , but our update stores the vectors pi and ri . The immediate advantage of this partial update over the original BFGS update is the applicability to large-scale problems. Even if N is very large, by setting s to an adequate small value with respect to the amount of available storage space, our update will work. On the other hand, the probable advantage over the memoryless BFGS update is the superiority of the convergence property. However, this claim should be examined through a wide range of experiments. Note that when s = 0, the partial BFGS update always gives the gradient direction; when s = 1, it corresponds to the memoryless BFGS update.

128

Kazumi Saito and Ryohei Nakano

2.5 Existing Methods for Calculating Step Lengths. In Step 4, since λ is the only variable in f , we can express f (Φ+λ1Φ) simply as ζ (λ). Calculating λ, which minimizes ζ (λ), is called a line search. Among a number of possible line search methods, we considered the following typical three methods: a fast but inaccurate one, a moderately accurate one, and an exact but slow one. The first method is explained below. By using quadratic interpolation (Gill et al. 1981; Battiti 1989), we can derive a fast line search method that only guarantees ζ (λ) < ζ (0). Let λ1 be an initial value for a step length. If ζ (λ1 ) < ζ (0), λ1 becomes the resultant step length; otherwise, by considering a quadratic function h(λ) that satisfies the conditions h(0) = ζ (0), h(λ1 ) = ζ (λ1 ), and h0 (0) = ζ 0 (0), we get the following approximation of ζ (λ): ζ (λ) ≈ h(λ) = ζ (0) + ζ 0 (0)λ +

ζ (λ1 ) − ζ (0) − ζ 0 (0)λ1 2 λ . λ21

Since ζ (λ1 ) ≥ ζ (0) and ζ 0 (0) < 0, the minimal point of h(λ) is given by λ2 = −

ζ 0 (0)λ21 . 2(ζ (λ1 ) − ζ (0) − ζ 0 (0)λ1 )

(2.6)

Note that 0 < λ2 < λ1 is guaranteed by equation 2.6. Thus, by iterating this process until ζ (λν ) < ζ (0), we can always find a λν that satisfies ζ (λν ) < ζ (0), where ν denotes the number of iterations. Here, the initial value of λ1 is set to 1 because the optimal step length is near 1 when H closely approximates (∇ 2 f (Φ))−1 . Hereafter, a quasi-Newton algorithm based on the original or partial BFGS update in combination with this fast but inaccurate line search is called BFGS1. By using quadratic extrapolation (Fletcher 1980), we can derive a moderately accurate line-search method that guarantees ζ 0 (λν ) > γ1 ζ 0 (λν−1 ) as well as ζ (λν ) < ζ (λν−1 ), where γ1 is a small constant (e.g., 0.1). Namely, when λν does not satisfy the stopping criterion, if ζ (λν ) ≥ ζ (λν−1 ), λν+1 is calculated using equation 2.6; otherwise, by considering an extrapolation to the slopes ζ 0 (λν−1 ) and ζ 0 (λν ), the appropriate expression for the minimizing value λν+1 is λν+1 = λν − ζ 0 (λν )

λν − λν−1 . ζ 0 (λν ) − ζ 0 (λν−1 )

(2.7)

However, if ζ 0 (λν ) ≤ ζ 0 (λν−1 ), then λν+1 does not exist; thus, λν+1 is set to λν−1 +γ2 (λν −λν−1 ), where γ2 is an adequate value (e.g., 9). In this method, λ0 is set to 0, and λ1 is estimated as min(1, −2ζ 0 (0)−1 ( f (Φcurrent )− f (Φprevious ))). Note that to use this estimate of λ1 for the first method is not good strategy because it does not have an extrapolation process, but λ1 can be a very small value. Hereafter, a quasi-Newton algorithm based on the original or partial

BFGS Update and Step-Length Calculation

129

BFGS update in combination with this moderately accurate line search is called BFGS2. An exact but slow line search method can be constructed by iteratively using the BFGS2 line search method until a stopping criterion is met, for example, kζ 0 (λν )k < 10−8 . The difference from the BFGS2 method is that equation 2.7 is used for interpolation as well as extrapolation. Hereafter, a quasi-Newton algorithm based on the original or partial BFGS update in combination with this exact but slow line search is called BFGS3. 2.6 New Method for Calculating Step Lengths. Here we propose a new method for calculating a reasonably accurate step-length λ in Step 4. 2.6.1 Basic Procedure. given as

A second-order Taylor approximation of ζ (λ) is

1 ζ (λ) ≈ ζ (0) + ζ 0 (0)λ + ζ 00 (0)λ2 . 2 When ζ 0 (0) < 0 and ζ 00 (0) > 0, the minimal point of this approximation is given by λ=−

ζ 0 (0) ζ 00 (0)

µ = −

(∇ f (Φ))T 1Φ (1Φ)T ∇ 2 f (Φ)1Φ

¶ .

(2.8)

Other cases will be considered in the next section. For the three-layer neural networks defined by equation 2.1, we can efficiently calculate ζ 0 (0) and ζ 00 (0) as follows. By differentiating ζ (λ) and substituting 0 for λ, we obtain ζ 0 (0) = −

m m X X (yt − zt )z0t and ζ 00 (0) = ((z0t )2 − (yt − zt )z00t ). t=1

t=1

Now that the derivative of zt = z(xt ; Φ) is defined as we obtain z0t = 1w00 +

h X i=1

(1w0i σit + w0i σit0 ) and z00t =

d dλ z(xt ; Φ + λ1Φ)|λ=0 ,

h X (21w0i σit0 + w0i σit00 ), i=1

where σit = σ (wTi xt ), σit0 = σit (1 − σit )(1wi )T xt , σit00 = σit0 (1 − 2σit )(1wi )T xt , and 1wij denotes the change of wij , calculated in Step 2. Now we consider the computational complexity of calculating the step length using equation 2.8. Clearly, (1wi )T xt must be calculated for each pair of hidden unit i and input xt ; thus, since the number of hidden units is h and the number of inputs is m, at least nhm multiplications are required. Since the order of multiplications required to calculate the remainder is O(hm), and N = nh + 2h + 1, the total complexity of the calculation is Nm + O(hm).

130

Kazumi Saito and Ryohei Nakano

This algorithm can be generalized to multilayered networks with other differentiable activation functions; thus, it is applicable to recurrent networks (Rumelhart et al. 1986). Consider a hidden unit τ connected to the output layer. We assume that its output value is defined by v = a(wT u), where u is the output values of the units connected to the unit τ , w equals the weights attached to these connections, and a(·) is an activation function. Then the first- and second-order derivatives of v with respect to λ are calculated as v0 = a0 (wT u)(1wT u + wT u0 ), v00 = a00 (wT u)(1wT u + wT u0 )2 + a0 (wT u)(21wT u0 + wT u00 ). Thus, by successively applying these formulas in reverse, we can calculate the step length for multilayered networks. Note that if ui is an output value of an input unit, then u0i = u00i = 0. 2.6.2 Coping with Undesirable Cases. In the above, we assumed ζ 0 (0) < 0. When ζ 0 (0) > 0, the value of the objective function cannot be reduced along the search direction; thus, we set 1Φk to −∇ f (Φk ) and restart the update by discarding the accumulated vectors (p, r). Note that ζ 0 (0) < 0 is guaranteed by such a setting unless k∇ f (Φk )k = 0 because ζ 0 (0) = (∇ f (Φk ))T 1Φk = −k∇ f (Φk )k2 < 0. When ζ 0 (0) < 0 and ζ 00 (0) ≤ 0, equation 2.8 gives a negative value or infinity. To avoid this situation, we employ the Gauss-Newton technique. The first-order approximation of zt = z(xt ; Φ + λ1Φ) is zt + z0t λ. Then, ζ (λ) of the next iteration can be approximated by ζ (λ) ≈

m m 1X 1X (yt − (zt + z0t λ))2 = ζ (0) + ζ 0 (0)λ + (z0 )2 λ2 . 2 t=1 2 t=1 t

The minimal point of this approximation is given by ζ 0 (0) λ = − Pm 0 2 . t=1 (zt )

(2.9)

Clearly, equation 2.9 always gives a positive value when ζ 0 (0) < 0. In many cases, it is useful from a practical sense to limit the maximum change in Φ, which should be done during any one iteration (Gill et al. 1981). Here, if kλ1Φk > 1.0, λ is set to k1Φk−1 . Since λ is calculated on the basis of the approximation, we cannot always reduce the value of the objective function, ζ (λ). When ζ (λ) ≥ ζ (0), we employ the fast line search given by equation 2.6. 2.6.3 Summary of Step-Length Calculation. By integrating the above procedures, we can specify Step 4 as follows.

BFGS Update and Step-Length Calculation

131

Step 4.1: If ζ 0 (0) > 0, set 1Φk = −∇ f (Φk ) and k = 1. Step 4.2: If ζ 00 (0) > 0, calculate λ using equation 2.8; otherwise, calculate λ using equation 2.9. Step 4.3: If kλ1Φk k > 1.0, set λ = k18k k−1 . Step 4.4: If ζ (λ) > ζ (0), calculate λ using equation 2.6 until ζ (λ) < ζ (0). Hereafter, the quasi-Newton algorithm based on our partial BFGS update in combination with our step-length calculation is called BPQ (Bp based on partial Quasi-Newton). Incidentally, a modified OSS algorithm, OSS2 (a combination of the memoryless BFGS update and our step-length calculation), may be another good algorithm and will be evaluated in the experiments. 2.7 Computational Complexity. We consider the computational complexity of BPQ and other algorithms with respect to one-iteration in which every training example is presented once. In off-line BP, the complexity (number of multiplications) to calculate the objective function is nhm + O(hm) and the complexity for the gradient vector is nhm + O(hm). Thus, since N = nh + 2h + 1, the complexity for off-line BP is 2Nm + O(hm). Here, the complexity for a weight update is just N (or 2N if momentum term is used), and is safely negligible. In this article, we define one-iteration of on-line BP as m updates sweeping all examples once. In addition to the above complexity, since on-line BP performs a weight update for each single example, the learning rate is multiplied to each element of the gradient vectors Nm times in one-iteration; thus, the complexity for on-line BP is 3Nm + O(hm). In the case of on-line BP with momentum term, the momentum factor is also multiplied to each element of the previous modification vectors Nm times in one-iteration; thus, the complexity for on-line momentum BP is 4Nm + O(hm). In addition to the complexity for off-line BP, BPQ calculates the descent direction based on the partial BFGS update with a history of at most s iterations and also calculates the step length. The complexity of the former calculation is O(Ns) (see Section 2.4) and that of the latter is Nm + O(hm) (see Section 2.6.1). Here, note that the computational complexity to calculate the objective function can be reduced from Nm + O(hm) to O(hm) for threelayer networks. This is because in the next iteration, the output value of each hidden unit is given by σ (wTi xt + λ(1wi )T xt ), but (1wi )T xt is already calculated when the step length is calculated. Thus, the total complexity for BPQ is 2Nm + O(Ns) + O(hm). To reduce the generalization error for an unseen example, m should be larger than N. Since s is smaller than N, the complexity of O(Ns) usually becomes much smaller than that of 2Nm, and the complexity for BPQ remains almost equivalent to that of off-line BP. A general method for calculating the denominator of equation 2.8 has been proposed (Pearlmutter 1994; Møller 1993a); after calculating ∇ 2 f (Φ)1Φ,

132

Kazumi Saito and Ryohei Nakano

the denominator is calculated by using an inner product. The result is mathematically the same as our step-length calculation, but the computational complexity of this method is much larger than that of our method, at least in the case of three-layer networks as shown below. By using Pearlmutter’s ∂ f (8+λ1Φ)|λ=0 (Pearlmutter operator, which is defined by <1Φ { f (Φ)} = ∂λ ∂ 1994), we can see that <1Φ { ∂wij f (Φ)} is an element of ∇ 2 f (Φ)1Φ because <1Φ {∇ f (Φ)} = ∇ 2 f (Φ)1Φ. Now, considering the case of a weight between the output unit and the hidden unit i, which is expressed as w0i , we obtain ( ) ¾ ½ m X ∂ f (Φ) = <1Φ − (yt − zt )σit <18 ∂w0i t=1 =

m X (<1Φ {zt }σit − (yt − zt )<18 {σit }), t=1

P where <1Φ {zt } = 1w00 + hi=1 (1w0i σit + w0i <1Φ {σit }) and <1Φ {σit } = σit (1 − σit )(1wi )T xt . Clearly, just like our method, (1wi )T xt must be calculated for each pair of hidden unit i and input xt ; thus, a minimum of nhm multiplications is required. Additionally, a complexity of at least O(m) is required for calculating <18 { ∂w∂ 0i f (Φ)}. Therefore, since a similar argument can be applied to other weights and the total number of the weights is N, the complexity of the existing method becomes nhm + O(Nm). On the other hand, the complexity of our method is nhm + O(hm), which is more efficient because N = nh + 2h + 1. Although the SCG (scaled conjugate gradient) algorithm (Møller 1993b) estimates the step length by using the model trust region approach, this method can be regarded as an efficient one-step line search method based on a one-sided difference equation. The step length calculated from equation 2.8 is approximated by ∇ 2 f (Φ)1Φ ≈

∇ f (Φ + δ1Φ) − ∇ f (Φ) , δ

where δ is defined as δ = δ0 k1Φk−1 and δ0 is a small constant. Clearly, the complexity of SCG is approximately double that of BP because the gradient vector ∇ f (Φ + δ1Φ) must be calculated during one-iteration. If the difference parameter δ is approximately equal to the optimal step-length λ, SCG will provide a more faithful picture of the error surface than our method; however, since our purpose is to estimate the optimal step-length λ, we generally do not know such δ. 3 Experiments This section evaluates BPQ’s performance in comparison with many other learning algorithms through experiments that use three types of problems.

BFGS Update and Step-Length Calculation

133

Figure 1: Two-weight problem.

3.1 Two-Weight Problem. In order to evaluate learning efficiency graphically, we designed a simple learning problem where only two weights were adjustable. Figure 1 describes the problem. In a three-layer network, each layer consists of only one unit, and the weight between the hidden unit and the output unit is fixed at 1. In this problem, the weight between the input and hidden units is expressed as w1 , the weight corresponding to the bias term at the output unit is expressed as w2 , and an activation function of a hidden unit is assumed to be σ (x) = 1/(1 + exp(−x)). Each target value yt shown in Figure 1 was calculated from the corresponding input value xt by setting (w1 , w2 ) = (1, 0). Thus, the global minimum point was given by these weight values. In this experiment, BPQ was compared with eight other learning algorithms: on-line BP with momentum term, off-line BP, off-line BP with momentum term, BP with learning rate adaptation (Silva and Almeida 1990), SCG (Møller 1993b), BFGS1, BFGS2, and BFGS3. Note that the memoryless BFGS update is equivalent to the original BFGS update when N = 2. The parameters of each algorithm were determined by trial and error or set to the values recommended by the inventors (Silva and Almeida 1990; Møller 1993b). For on-line momentum BP, the learning rate and the momentum factor were set to η = 0.05 and α = 0.9. For off-line BP, the learning rate η was set to 0.1. For off-line momentum BP, the learning rate and the momentum factor were set to η = 0.025 and α = 0.9. For adaptive BP, the increase and the decrease factors were set to u = 1.1 and d = u−1 , and the initial update value η0 was set to 0.01. For SCG, the constant δ0 was set to 10−4 , and the initial scaling parameter λ1 was set to 10−6 . Figures 2 and 3 show the learning trajectories on the error surface with respect to w1 and w2 during 100 iterations starting at (w1 , w2 ) = (−0.5, 0), where MSE stands for mean squared error: MSE =

5 1X (yt − (σ (w1 xt ) + w2 ))2 . 5 t=1

(These experiments were done on a Power Macintosh 8100/100 computer.)

134

Kazumi Saito and Ryohei Nakano

Figure 2: Learning trajectories of first-order algorithms.

Figure 2 shows that the search of these first-order algorithms went very slowly along the curved valley floor. Up to 100 iterations, any first-order algorithm was unable to reach the minimum point. This may be due to the inherent difficulty of first-order methods; since the directions of the successive gradient vectors are almost opposite near the valley floor, even learning rate adaptation by sign changes of the last two gradients cannot improve the learning efficiency. Note that the trajectories of BPs oscillate widely when a larger η is used. On the other hand, Figure 3 shows that SCG, BFGS1, BFGS2, BFGS3, and BPQ found the minimum point within 100 iterations. In comparison with BPQ, SCG required a larger number of iterations before reaching the minimum point. BFGS1 required the largest number of iterations, showing that the accuracy of the step-length calculation is important for solving this problem. BFGS2 worked very well, almost identical to BPQ. The number of iterations for BFGS3 was slightly smaller than that for BPQ, but BFGS3 required much more computation time due to an exact line search. These

BFGS Update and Step-Length Calculation

135

Figure 3: Learning trajectories of second-order algorithms.

experimental results indicate that second-order methods will work well even if an error surface forms a curved valley floor. 3.2 Parity Problem. Since parity problems have been widely used as benchmarks, the eight-bit parity problem was used to compare BPQ with seven other algorithms: on-line momentum BP, adaptive BP, SCG, OSS2, BFGS1, BFGS2, and BFGS3. Since the scale of the problem was quite small, BPQ employed the original BFGS update; thus s = N. In the experiments, we used a three-layer network with eight hidden units (h = 8, N = 82).

136

Kazumi Saito and Ryohei Nakano

Figure 4: Convergence for the 8-bit parity problem.

All possible input patterns were used as training examples (m = 256), and the target values were set to 1 or 0. The parameters of each algorithm were exactly the same as in the previous experiment, except that the learning rate η was set to 0.1 for on-line momentum BP. In all the algorithms, the initial values for the weights were randomly generated in the range of [−1, 1]. In the experiments, the maximum CPU processing time was set to 100 seconds. The iteration was terminated when k∇ f (w)k < 10−8 . (These experiments were done on an HP9000/735 computer.) Figure 4a shows convergence property of BPQ and first-orderp algorithms with respect to the average RMSE (root mean squared error): 2 f (w)/m) and the standard deviation over 100 trials for each algorithm. Among these algorithms, BPQ converged the quickest, allowing us to stop the iteration at an early stage. Figure 4b compares convergence property of BPQ with second-order algorithms. In comparison with BPQ, the convergence of the other second-order algorithms was slow. This shows that the step-length calculation plays an important role for quasi-Newton methods.

BFGS Update and Step-Length Calculation

137

3.3 Practical Problem. In order to evaluate how BPQ works on largescale problems, we performed experiments on a speech synthesis problem by using a data set used for parrot-like speaking (Nakano et al. 1995). This learning problem is a sort of nonlinear autoregression; the input vector consists of a feature vector (10 values) and an acoustic wave (8 values), and the target output value is the next value of the acoustic wave; the total number of input units is 18 (n = 18). In these experiments, the number of hidden units was set to 36 (h = 36); thus, the number of weights was 721 (N = 721). By using 12,800 training examples (m = 12, 800), BPQ was compared with eight other algorithms: on-line BP, on-line momentum BP, adaptive BP, SCG, OSS2, BFGS1, BFGS2, and BFGS3. In the algorithms, the initial values for the weights between the input and hidden units were randomly generated in the range of [−1, 1]. The values for the weights between the hidden and output units were set to 0, but the bias value at the output unit was set to the average output value of all training examples. In the experiments, the maximum number of iterations was set to 100, and the average RMSE was used to evaluate the convergence property. (These experiments were also done on an HP9000/735 computer.) Figure 5a compares BPQ with the first-order algorithms using the convergence property with respect to the average RMSE and standard deviation over 10 trials for each algorithm. Note that standard deviation for adaptive BP is not shown because it was always 0. For on-line BP, the learning rate η was set to 1.0, the best value in our experiments; however, the learning curve oscillated widely. For on-line momentum BP whose momentum factor α was fixed at 0.9, the learning rate η was set to 0.1, the best value in our experiments; although the learning curve did not oscillate widely, the convergence property was very similar to that of on-line BP. For adaptive BP, the learning rate η was set to 0.1 or 1.0; both values gave similar and rather poor results. For BPQ, the parameter s was set to 50; convergence was the quickest, allowing us to stop the iteration at an early stage. Figure 5b compares BPQ with other second-order algorithms using the average values over 10 trials for each algorithm. In comparison to BPQ, BFGS1 worked rather poorly; BFGS3 required more processing time to achieve the level equal to the best RMSE of BPQ, while BFGS2 worked better than BFGS3, rather close to BPQ. Figure 5c compares the average one-iteration processing time of all algorithms. One-iteration time of OSS2 or BPQ was almost equivalent to that of adaptive BP. On-line BP was about 1.3 times slower due to weight updates for each single example; on-line momentum BP was almost twice as slow due to additional multiplication of the momentum factor. BFGS1 required a half more processing since it sometimes required a few function evaluations in order to shorten the step lengths; BFGS2 was almost four times slower due to an additional line search calculation to meet its stopping criterion; BFGS3 was almost seven times slower due to an exact line search calculation. SCG was almost twice as slow due to an additional gradient calculation.

138

Kazumi Saito and Ryohei Nakano

Figure 5: Convergence for a practical problem.

Figure 6 shows the influence of the partiality parameter s on the convergence property for each of the four line search algorithms. Each curve is drawn using the average RMSE over 10 trials. In general, the inference was small compared with the difference of line search methods. BPQ with s = 5 worked as nicely as with s = 50, while BPQ with s = 2 worked poorly compared with BPQ with s ≥ 5. Consequently, it was shown that an efficient and accurate step-length calculation plays an important role for the convergence of quasi-Newton algorithms, while the partial BFGS update with a proper s greatly saves storage space without losing the convergence performance.

BFGS Update and Step-Length Calculation

139

Figure 6: Effect of partiality parameter s.

4 Conclusion We have proposed a small-memory efficient second-order learning algorithm called BPQ for three-layer neural networks. In BPQ, the search direction is calculated on the basis of a partial BFGS update, and a reasonably accurate step length is efficiently calculated as the minimal point of a second-order approximation. In our experiments that used a two-weight problem, a parity problem, and a speech synthesis problem, BPQ worked better than major learning algorithms. Moreover, it turned out that an efficient and accurate step-length calculation plays an important role for the convergence of quasi-Newton algorithms, while the partial BFGS update greatly saves storage space without losing the convergence performance. In the future, we plan to do further comparisons using a wider variety of problems, including large-scale classification problems.

Acknowledgments We thank anonymous referees for many helpful suggestions and comments.

140

Kazumi Saito and Ryohei Nakano

References Barnard, E. 1992. Optimization for training neural nets. IEEE Trans. Neural Networks 3(2), 232–240. Battiti, R. 1989. Accelerating back-propagation learning: Two optimization methods. Complex Systems 3(4), 331–342. Battiti, R. 1992. First- and second-order methods for learning between steepest descent and Newton’s method. Neural Comp. 4(2), 141–166. Fahlman, S. 1988. Faster-learning variations on back-propagation: an empirical study. In Proceedings of the 1988 Connectionist Models Summer School, pp. 38–51. San Mateo, CA. Fletcher, R. 1980. Practical Methods of Optimization. Vol. 1. John Wiley, New York. Gill, P., Murray, W., and Wright, M. 1981. Practical Optimization. Academic Press, London. Jacobs, R. 1988. Increased rates of convergence through learning rate adaptation. Neural Networks 1(4), 295–307. Kuhn, G., and Herzberg, P. 1990. Some variations on training recurrent neural networks. In Proceedings of CAIP Neural Networks Workshop, pp. 15–17, Rutgers University, New Brunswick, NJ. LeCun, Y., Simard, P., and Pearlmutter, B. 1993. Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In Neural Information Processing Systems 5, S. Hanson, J. Cowan, and C. Giles, eds., pp. 156– 163. Morgan Kaufmann, San Mateo, CA. Luenberger, D. 1984. Linear and Nonlinear Programming. Addison-Wesley, Reading, MA. Møller, M. 1993a. Exact calculation of the product of the Hessian matrix of feedforward network error functions and a vector in O(N) time. Tech. Rep. DAIMI PB-432, Computer Science Department, Aarthus University. Møller, M. 1993b. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6(4), 525–533. Møller, M. 1993c. Supervised learning on large redundant training sets. International J. Neural Systems 4(1), 15–25. Nakano, R., Ueda, N., Saito, K., and Yamada, T. 1995. Parrot-like speaking using optimal vector quantization. In Proc. IEEE International Conference on Neural Networks, Parse, Australia. Pearlmutter, B. 1994. Fast exact multiplication by the Hessian. Neural Comp. 6(1), 147–160. Powell, M. 1976. Some global convergence properties of a variable metric algorithm for minimization without exact line searches. In Nonlinear Programing, SIAM-AMS Proceedings, Vol. 9, Providence, RI. Riedmiller, M., and Braun, H. 1993. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In Proc. IEEE International Conference on Neural Networks, San Francisco. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, D. Rumelhart and J. McClelland, eds., pp. 318–362. MIT Press, Cambridge, MA.

BFGS Update and Step-Length Calculation

141

Silva, F., and Almeida, L. 1990. Speeding up backpropagation. In Advanced Neural Computers, R. Eckmiller, ed., pp. 151–160. North-Holland, Amsterdam. Watrous, R. 1987. Learning algorithms for connectionist networks: Applied gradient methods of nonlinear optimization. In Proc. IEEE International Conference on Neural Networks, pp. II619–627, San Diego, CA.

Received May 22, 1995; accepted April 8, 1996.

Communicated by Frederico Girosi

Neural Networks for Functional Approximation and System Identification H. N. Mhaskar Department of Mathematics, California State University, Los Angeles, CA 90032 USA

Nahmwoo Hahm Department of Mathematics, University of Texas at Austin, Austin, TX 78712 USA

We construct generalized translation networks to approximate uniformly a class of nonlinear, continuous functionals defined on Lp ([−1, 1]s ) for integer s ≥ 1, 1 ≤ p < ∞, or C([−1, 1]s ). We obtain lower bounds on the possible order of approximation for such functionals in terms of any approximation process depending continuously on a given number of parameters. Our networks almost achieve this order of approximation in terms of the number of parameters (neurons) involved in the network. The training is simple and noniterative; in particular, we avoid any optimization such as that involved in the usual backpropagation. 1 Introduction One of the important applications of neural networks is to approximate functions defined on a compact subset of a Euclidean space in a highly parallel manner. (We give precise definitions in Section 2.) Following the well-known initial results of Cybenko (1989), Hornik et al. (1989), Park and Sandberg (1991), and Barron (1993), among others, there has been a great deal of activity in this area in recent years. Motivated by the work of Girosi et al. (1995), Mhaskar and Micchelli (1995) have introduced the notion of generalized translation networks, unifying the theme of neural networks, radial basis function networks, and generalized regularization networks of Girosi et al. (1995). They studied in detail how the choice of the activation function affects the degree of approximation of the target function. Subsequently, Mhaskar (1996) has constructed generalized translation networks utilizing specific activation functions to obtain an optimal order of approximation when the only a priori information about the target function is that it belongs to some Sobolev class. In this article, we obtain analogous results when a generalized translation network is used to approximate functionals on compact subsets of an L p space rather than functions of finitely many real variables. The problem of approximating a functional arises naturally in the study of identification and control of nonlinear systems. In an identification problem, the hidden Neural Computation 9, 143–159 (1997)

c 1997 Massachusetts Institute of Technology °

144

H. N. Mhaskar and Nahmwoo Hahm

states of a nonlinear system are not known, and one wishes to develop a model merely by observing the input-output relationships of such a system (cf. Levin and Narendra 1995 for a recent discussion). The output at any given time is a functional of the input signal, which is a function of time. Similarly, a control problem can be thought of as a functional, which at any given time maps the input control signals to an output signal. The actual system itself is often unknown, and a simple model is desirable (cf. Sontag N.d. for a recent review). Narendra and Parthasarathy (1990) have proposed different models that can utilize the approximation capabilities of neural networks for system identification and control. Levin and Narendra (1995) have studied neural networks for the identification of systems using only observations on the input-output relationships of the system. In both of these papers, the output signal at any time is considered as a function of finitely many samples of the input and output signals, typically taken prior to the moment in question. A different approach has been suggested by Sandberg (1991b), where the system is thought of directly as a functional acting on the input signal, and an NL system is used as a model for this functional. Here, N is a static neural network, L has a simple representation in terms of bounded linear functionals, and the model is the composite map N ◦ L. Sandberg has shown that any real continuous functional on any compact subset of a real normed linear space can be uniformly approximated in this way and has given several related results concerning, for example, the choice of the linear functional L that can be used. He has also studied the case when radial (or elliptic) basis function networks are used instead of the conventional neural networks. Further research in this direction has also been done by Chen and Chen (1993, 1995a, 1995b), Dingankar (1995), and Modha and Hecht-Nielsen (1993). In this article, we investigate the question of obtaining bounds on the size of the networks involved in terms of the desired accuracy of approximation. As expected, the bound will depend on the properties of the compact subset of the function space on which the functional acts and also on the structural properties of the functional itself. Typically, both of these are likely to be unknown. Therefore, one is led to the notion of universal approximation— approximating the functional under minimal a priori assumptions on the functional and the input signals on which it acts. We prove that there are some inherent lower bounds for the accuracy of such universal approximation. We will also construct generalized translation networks that almost achieve this lower bound. The networks to be constructed are very general and include as special cases gaussian networks, thin-plate spline networks, neural networks with the hyperbolic tangent activation function, and a variety of other commonly used networks. Although our focus is to obtain a theoretical insight into the problem, our proof suggests an explicit formula for the networks. There is no training involved in the usual sense. In particular, we do not use any optimization technique, such as minimization of a

Neural Networks for Functional Approximation

145

backpropagation error surface. Thus, the large number of neurons required for our networks is offset by the extremely simple “training,” and we do not encounter any problems commonly associated with optimization, such as local minima. In the next section, we state and discuss our results. The proofs are given in Section 3. 2 Main Results Following the approach in the papers of Sandberg (1991a, 1991b), we think of an arbitrary nonlinear system as a continuous functional F acting on a compact subset of some L p space. Let s ≥ 1 be an integer, which will be fixed throughout the rest of this paper. If A ⊂ Rs is a (Lebesgue-) measurable set and f : A → R is a measurable function, we write

k f kp,A :=

½Z ¾1/p   | f (t)|p dt , A

 ess sup | f (t)|,

if 1 ≤ p < ∞, if p = ∞.

(2.1)

t∈A

The space L p (A) is defined to be the class of all functions f : A → R for which k f kp,A < ∞. As usual, we consider two functions to be equal if they are equal almost everywhere in the measure theoretic sense. In this article, if A = [−1, 1]s , then we will omit the mention of this set from the notation; thus, we write k f kp to denote k f kp,[−1,1]s and so forth. Further, we will have no occasion to consider functions in L∞ (A) that are not continuous. Therefore, we will simplify our notations by using L∞ (A) to denote also the class C(A), consisting of bounded, uniformly continuous functions on A. We also recall a few facts about compact subsets of L p . There are several known characterizations of compact sets in L p (A) (Sobolev 1963); for illustrating our theorems, we find the following characterization useful, where we restrict our attention to the case when A = [−1, 1]s . For any real number λ > 0, we denote the class of all algebraic polynomials in s variables of coordinatewise degree not exceeding λ by 5λ,s . If f ∈ L p and λ ≥ 0 we write Eλ,p,s ( f ) := inf k f − Pkp .

(2.2)

P∈5λ,s

It is not too difficult to verify (cf. Lorentz 1953, p. 33) that a closed set K ⊂ L p is compact if and only if there is a constant MK > 0 and a sequence of positive numbers {²m,K }, converging to 0 as m → ∞, such that for every f ∈ K, k f kp ≤ MK ,

Em,p,s ( f ) ≤ ²m,K ,

m = 0, 1, . . . .

(2.3)

For example, for the set of functions satisfying a Holder ¨ condition of order α, one may choose ²m,K = cm−α for a suitable constant c.

146

H. N. Mhaskar and Nahmwoo Hahm

Given a system defined by a functional F on a compact set K ⊂ L p , the objective of system identification is to build another functional G on this compact set that will approximate F well and also serve as a model with known parameters. Although the problem arises because F is typically unknown, it is reasonable to make some a priori assumptions on a class of functionals to which F may belong, as is done commonly in the theory of approximation of functions. Moreover, if we wish to consider the system not as a functional but as an operator T producing an output signal Tf on receiving an input signal f , this operator can be thought of as a set of functionals: Each t in the domain of the output signal defines a functional Tf (t). A model for this operator will achieve uniform approximation of all these functionals. Thus, in this paper, we are interested in approximating every functional belonging to a class F , satisfying some minimal conditions, which we now describe. We recall that a function Ä: (0, ∞) → (0, ∞) is said to be a modulus of continuity (Lorentz 1966) if each of the following properties holds: (a) Ä(h) → 0 as h → 0, (b) Ä is a positive and increasing function, and (c) Ä is subadditive; that is, Ä(h1 + h2 ) ≤ Ä(h1 ) + Ä(h2 ),

h1 , h2 > 0.

(2.4)

For example, for any α ∈ (0, 1), the function Ä(h) = hα is a modulus of continuity. For any real function F uniformly continuous on L p , the quantity sup{|F( f ) − F(g)|: k f − gkp ≤ δ}, considered as a function of δ, is a modulus of continuity (of F). More generally, if X is a topological space and T: L p → C(X) is a bounded, uniformly continuous function, then the quantity sup{|Tf (x) − Tg(x)|: x ∈ X, k f − gkp ≤ δ}, considered as a function of δ, is a modulus of continuity (of T). For each x ∈ X, the modulus of continuity of T is a majorant for the modulus of continuity of the functional f → Tf (x). Motivated by this observation and the Jackson theorem in the theory of trigonometric approximation, the only assumption we wish to make about the functional is that there be a known majorant for its modulus of continuity. Accordingly, we will be interested here in the universal approximation of the class

F := FÄ,p,s := {F: L p → R: |F( f ) − F(g)| ≤ Ä(k f − gkp ), f, g ∈ L p },

(2.5)

where Ä is a given modulus of continuity. Our main objective in this article is to construct generalized translation networks to model every functional in F . We now proceed to describe these networks. Let 1 ≤ d ≤ D, N ≥ 1 be integers, and φ: Rd → R. A generalized translation network with N neurons evaluates a function of the form

Neural Networks for Functional Approximation

147

PN

k=1 ak φ(Ak (·) + bk ) where the weights Ak ’s are d × D real matrices, the thresholds bk ∈ Rd , and the coefficients ak ∈ R (1 ≤ k ≤ N). The set of all such functions (with a fixed N) will be denoted by 5φ;N,D . When d = 1, 5φ;N,D is the set of all outputs of a neural network with N neurons, each evaluating the activation function φ and receiving D inputs. When d = D and φ is a radially symmetric function, then 5φ;N,D denotes the set of all outputs of a radial basis function network. Girosi et al. (1995) have pointed out the importance of the study of the more general case considered here. They have demonstrated how such general networks arise naturally in such applications as image processing and graphics, as solutions of certain extremal problems. Our first theorem in this section is Theorem 2.1. For multiintegers k = (k1 , . . . , kd ), m = (m1 , . . . , md ) ∈ Zd , we write k ≥ m if kj ≥ mj , 1 ≤ j ≤ d. If t ∈ R, we write k ≥ t if kj ≥ t, 1 ≤ j ≤ d. The set of all k ∈ Zd with k ≥ 0 will be denoted by Zd+ . Throughout the rest of this article, we adopt the following convention regarding constants: The letters c, c1 , c2 , . . . will denote positive constants depending only on p, d, s, φ, K, and F (equivalently Ä), but independent of any other variables not explicitly indicated. Their values may be different at different occurrences, even within the same formula.

Theorem 2.1. Let d ≥ 1 be an integer, and φ: Rd → R be infinitely many times continuously differentiable in some open sphere in Rd . We further assume that there exists b in this sphere such that Dk φ(b) 6= 0,

k ∈ Zd+ .

(2.6)

Let s ≥ 1, N ≥ 1, m ≥ (log d)/s be integers, 1 ≤ p ≤ ∞, K be a compact subset of L p , Ä be a modulus of continuity, and F be the set as in equation 2.5. We write D := (2m + 1)s . There exist continuous linear functionals γk : C(K) → R, k = 1, . . . , N, and a continuous linear operator Lm : L p → RD , with the following property: For every F ∈ F , there exist d × D matrices Ak := Ak (F) with rank d, k = 1, . . . , N, such that ¯ ¯ N ¯ ¯ X ¯ ¯ γk (F)φ(Ak (Lm ( f )) + b)¯ sup ¯F( f ) − ¯ ¯ f ∈K n

k=1

o ≤ c Ä(²m,K ) + Ä(mβ N−1/D ) ,

(2.7)

where β := 2s max(1/p, 1 − 1/p). It may seem a bit awkward to define the matrices Ak and operator Lm as we did in the above theorem rather than just defining operators from L p into Rd . The reason for this will be explained during our discussion of the lower bounds. It is proved in Mhaskar (1996) that the condition 2.6 is satisfied for

148

H. N. Mhaskar and Nahmwoo Hahm

a large class of activation functions, including each of the following, where Pd 2 1/2 xj ) : for x ∈ Rd , we write |x| := ( j=1 d = 1, φ(x) := (1 + e−x )−1 , d ≥ 1, φ(x) := (1 + |x|2 )α , α 6∈ Z, d ≥½1, q ∈ Z, q > d/2, |x|2q−d log |x|, d even, φ(x) := |x|2q−d , d odd,

(The squashing function) (Generalized multiquadrics) (Thin plate splines)

and d ≥ 1,

φ(x) := exp(−|x|2 ).

(The gaussian function)

An important special case of the compact set K in Theorem 2.1 is the case when ²m,K = O(m−α ) for some α > 0. In this case, the order of magnitude of the quantity on the right-hand side of estimate 2.7 is minimized if we choose m to be a solution of the equation (2m + 1)s log m =

log N ; α+β

(2.8)

that is, (cf. Olver 1974, Ex. 5.7, p. 14) µ m∼

log N log log N

¶1/s .

(2.9)

(Our notation here is different from that in Olver 1974. By A ∼ B, we mean c1 B ≤ A ≤ c2 B.) The right-hand side of estimate 2.7 is then estimated by Ãµ ¶ ! log log N α/s . cÄ log N We observe that if NP denotes the number of real parameters in the network of estimate 2.7, then this order of magnitude is also Ãµ ¶ ! log log NP α/s . cÄ log NP This looks like a very weak rate of convergence indeed, but we will prove that under the weak a priori assumptions we have made, this result cannot be improved in general, except possibly for the factor of log log NP in the numerator above. Before describing these lower bounds, we take this opportunity to make a few remarks about the construction of the networks. We observe that the operator Lm is independent of the functionals F or the input signals f . During the proof, we will give explicit expressions for the quantities Lm , Ak ,

Neural Networks for Functional Approximation

149

and the coefficients γk . It will then be seen that the matrices Ak (and clearly the threshold b) are all uniformly bounded, independent of the accuracy desired. It will also be seen that the matrices Ak depend not so much on F itself but on the entire class F and on sup{|F(P)|: P ∈ 52m,s , kPkp ≤ c(N, m, p, s)} for some constant c depending on the indicated parameters only. In particular, if we restrict the set F to consist of functionals satisfying an a priori bound, then the matrices Ak will also be independent of F. In function approximations, a bound on the Sobolev norm already implies such a bound on the norms of the target functions. There are technical reasons for not imposing such a bound at the outset in this context. Nevertheless, there is no training involved in finding these and other parameters of the network, and we have avoided all the shortcomings of the usual optimization-based techniques, such as backpropagation. Necessarily, the coefficients γk will be unbounded as the accuracy tends to 0; this is inevitable even in the case of ordinary function approximation (Mhaskar forthcoming). This situation is akin to the case of the familiar polynomial approximation on the interval. If one writes the approximating polynomials in terms of the monomials, the coefficients are far from being well behaved. A change of basis to some system of orthogonal polynomials cures this problem. It will be seen from our proof that this is exactly the situation here as well. Thus, to implement the networks in practice, one should first implement certain auxiliary networks. These are independent of the actual systems being implemented, and the extra effort will pay off in terms of more stable coefficients of the resulting networks in actual approximation of the systems. In the case of radial basis function networks, our matrices are not all equal to the identity matrix; our networks are not pure translation networks. The ideas in Mhaskar (1995) can be used to construct gaussian networks to achieve the same rate of approximation, where only pure translations are required. We will not pursue these ideas here. An examination of our proofs here and those in Mhaskar (1995) will show clearly how to do this. No new ideas are involved in this construction. Now we turn our attention to the lower bounds. These will be obtained for any models, not just for generalized translation networks. A model can be described mathematically in terms of two functions: a function π : F → RN , which selects the parameters of the model, and a function MN : RN → C(K), which reconstructs the model using these parameters. For any system F ∈ F , MN (π(F)) is the model simulating F. For example, in the case of a neural network model, N denotes the total number of its real parameters, including the coefficients, components of weights, and thresholds. The training of the network consists of defining the function π , which defines these parameters given a target function. The mapping M then defines the output of the network itself, given these parameters. In our formulation of Theorem 2.1,

150

H. N. Mhaskar and Nahmwoo Hahm

the operator Lm , being independent of F and f , is part of the definition of M. The lower bounds in the discussion below will apply also even if the weights Ak were to depend on F in a much more complicated manner. The split formulation helps us to allow this dependence without introducing new definitions of “widths” (see below) and also underlines the possible interdependence of π and M. It is naturally desirable that the function π should be continuous on F , so as to achieve a robust model. As pointed out by Helmicki et al. (1991), it is desirable to measure the error in this simulation in the worst-case scenario. Mathematically, this amounts to estimating the quantity sup |F( f ) − MN (π(F), f )| . f ∈K

Since F itself is unknown as well, we are led to consider sup |F( f ) − MN (π(F), f )| .

F∈F , f ∈K

The inherent error in modeling is then measured by the nonlinear N-width (DeVore et al. 1989), defined as 1N (F ) := inf

sup |F( f ) − MN (π(F), f )|,

MN ,π F∈F , f ∈K

(2.10)

where the infimum is taken over all continuous functions π : F → RN and operators MN : RN → C(K). The following theorem gives a lower bound for the nonlinear N-width for F in case of certain compact sets K ⊂ L2 . To describe this compact P set, we write |k|∞ = max1≤j≤s |kj |, k = (k1 , . . . , ks ) ∈ Zs . For f ∈ L2 , let ak ( f )Pk denote the Fourier-Legendre expansion of f (cf. formulas 3.2 and 3.5). Theorem 2.2.

Let s ≥ 1 be an integer, α > 0,

( K :=

2

f ∈L :

X

) 2 |k|2α ∞ |ak ( f )|

≤1 .

(2.11)

k

Let Ä be a modulus of continuity. Then for integer N ≥ 1, 1N (FÄ,2,s ) ≥ cÄ((log N)−α/s ).

(2.12)

3 Proofs In this section, we prove Theorems 2.1 and 2.2. The idea behind the proof of Theorem 2.1 is the following: From F and K, we will construct a family of functions on a ball in RD . Each member of this family will be approximated

Neural Networks for Functional Approximation

151

by a polynomial of a certain degree n, where the rate of approximation is given by Newman and Shapiro (1964). Neural networks to approximate such polynomials are developed in Mhaskar (1996). We start this program by recalling the definition and a few facts about Legendre polynomials. A standard way to define the univariate, orthonormalized Legendre polynomial is by Rodrigues’s formula (Szego¨ 1975): √ µ ¶ (−1)n n + 1/2 d n Pn (x) := {(1 − x2 )n }, 2n n! dx x ∈ R, n = 0, 1, . . . .

(3.1)

If ` ≥ 2, k = (k1 , . . . , k` ) ∈ Z`+ , and x = (x1 , . . . , x` ) ∈ R` , then we write

Pk (x) :=

` Y

Pkj (xj ).

(3.2)

j=1

It is well known that ½ Z 1, Pk (x)Pm (x) dx = 0, [−1,1]`

if k = m, otherwise.

(3.3)

If g ∈ L1 ([−1, 1]` ), its Fourier-Legendre coefficients are defined by Z ak (g) :=

[−1,1]`

g(t)Pk (t) dt,

(3.4)

and the expression X

ak (g)Pk ,

(3.5)

k∈Z` ,k≥0

convergent or not, is called its Fourier-Legendre expansion. Even if the expansion might not converge, the multisequence {ak (g)} uniquely determines g. We develop some further notation. If D is as in Theorem 2.1, a ∈ RD , then we may index the coordinates of a by elements of {0, 1, . . . , 2m}s instead of the usual indexing by elements of {1, 2, . . . , (2m + 1)s }. Then the operator I5: RD → 52m,s is defined by I5(a) :=

X

ak Pk .

(3.6)

0≤k≤2m

The quantity |a| will denote the Euclidean norm of a. The following lemma establishes an important connection between |a| and kI5(a)kp .

152

H. N. Mhaskar and Nahmwoo Hahm

Lemma 3.1.

Let m ≥ 0 be an integer, D = (2m + 1)s . Then, with

Am,p,s := m2s max(1/p−1/2,0) ,

Bm,p,s := m2s max(1/2−1/p,0) ,

(3.7)

we have |a| ≤ cAm,p,s kI5(a)kp ≤ c1 Am,p,s Bm,p,s |a| = c1 mβ−s |a|,

(3.8)

where β := 2s max(1/p, 1 − 1/p). Proof. If 0 < q, r ≤ ∞, ` ≥ 1 be an integer, and P ∈ 5`m,1 is a univariate polynomial, then it is known (Nevai 1979, p. 114) that ½ kPkq,[−1,1] ≤

cm2(1/r−1/q) kPkr,[−1,1] , ckPkr,[−1,1] ,

if r < q if r ≥ q .

(3.9)

The case q = ∞ is not included on p. 114 of Nevai (1979). However, it forms the basis for the proof, and is proved separately in the form of Theorem 13 (p. 113) and Lemma 5 (p. 108) of Nevai (1979). The second inequality above is, of course, just a consequence of Holder’s ¨ inequality. If P ∈ 5`m,s , we apply the above estimates one by one to each of the coordinates of its argument to obtain ½ 2s(1/r−1/q) kPkr , if r < q cm (3.10) kPkq ≤ if r ≥ q . ckPkr , The estimate 3.8 is now easy to deduce using the Parseval identity. The following lemma gives the first step of our proof: the construction of a family of functions on a ball of RD . We assume the notation and conditions of Theorem 2.1. Lemma 3.2.

There exists a continuous linear operator Lm : L p → RD such that

|F( f ) − F(I5(Lm ( f )))| ≤ cÄ(²m,K ),

F ∈ F , f ∈ K.

(3.11)

Moreover, |Lm ( f )| ≤ CAm,p,s k f kp ,

f ∈ L p,

(3.12)

where C > 0 is a constant depending only on p and s, but not on m. Proof. We take any continuous linear operator Vm : L p → 52m,s such that k f − Vm ( f )kp ≤ cEm,p,s ( f )

(3.13)

Neural Networks for Functional Approximation

153

for all f ∈ L p , where c is a positive constant depending only on p and s, and not on f or m. There are many such operators known in the literature (cf. Timan 1963; Lorentz 1966; Mhaskar 1996) and the actual choice is immaterial for this proof. We write Lm ( f ) := (ak (Vm ( f )))0≤k≤2m,k∈Zs .

(3.14)

Then Lm is a continuous linear operator on L p , and k f − I5(Lm ( f ))kp = k f − Vm ( f )kp ≤ c²m,K ,

f ∈ K.

(3.15)

The estimate 3.11 is now clear. From Lemma 3.1, we obtain for f ∈ L p that |Lm ( f )| ≤ cAm,p,s kVm ( f )kp ≤ cAm,p,s k f kp .

(3.16)

This proves (3.12). Until the end of the proof of Theorem 2.1, we retain the value of the constant C in estimate 3.12. In order to describe the next step of our proof, we define for any F ∈ F , a function µm (F) of D variables by the formula µm (F, a) := F(I5(a)),

a ∈ RD .

(3.17)

The next lemma gives a bound on the degree of polynomial approximation of µm (F), thus completing the second step in our proof. Lemma 3.3. With the notation as in Theorem 2.1, for any F ∈ F , and integer n ≥ 1, there exists a polynomial Q(F) ∈ 5n,D such that |µm (F, a) − Q(F, a)| ≤ cÄ(mβ /n),

|a| ≤ CAm,p,s MK .

(3.18)

Proof. Using Lemma 3.1 and the definitions of the functions µm , I5, we obtain for any a, d ∈ RD , |µm (F, a) − µm (F, d)| = |F(I5(a)) − F(I5(d))|

(3.19)

≤ Ä(kI5(a) − I5(d)kp ) ≤ cÄ(Bm,p,s |a − d|). The assertions of Lemma 3.3 now follow from Theorem 3 in Newman and Shapiro (1964). Finally, we prove another lemma, which will help us to replace the polynomials Q(F) by generalized translation networks.

154

H. N. Mhaskar and Nahmwoo Hahm

Lemma 3.4. Let φ satisfy the conditions of Theorem 2.1, n ≥ 1 be an integer, and k ∈ ZD + and k ≤ n. Then for every ² > 0, there exists Gk,n,² ∈ 5φ;(6n+1)D ,D such that kPk − Gk,n,² k∞,[−1,1]D ≤ ². The matrices involved in Gk,n,² may be chosen to be matrices of rank d. Further, the matrices in each Gk,n,² may be chosen from a fixed set with cardinality not exceeding (6n + 1)D . The threshold of each neuron in each Gk,n,² is b as defined in condition 2.6. Proof. The proof of this lemma is almost verbatim the same as that of Lemma 3.2 of Mhaskar (1996). We omit the details, but point out only the necessary changes. For w ∈ RD , we define the d × D matrices Aw by    Aw := diag(w , . . . , w ) 1 d 

.. . .. . .. .

 wd+1 , . . . , wD   , ·········  

(3.20)

0

and write 8(w; x) := φ(Aw x + b),

x ∈ RD .

Then, as in Lemma 3.2 of Mhaskar (1996), we may express every monomial xp (p ∈ ZD + ) as a multiple of a derivative of 8 with respect to w. A divided difference then gives a network 8p,h ∈ 5φ;(p1 +1)...(pD +1),D that approximates xp uniformly on [−1, 1]D to any desired degree of accuracy. The network Gk,n,² is obtained by expressing Pk in terms of monomials and then replacing each monomial by the approximating network. The matrices involved are of the form of equation 3.20. By slight perturbations, one may also ensure the matrices to be of rank d and yet ensure that the set of matrices is fixed, independent of k. Proof of Theorem 2.1. If f ∈ K, F ∈ F , estimate 3.12, and Lemma 3.3 gives a polynomial Q(F) ∈ 5n,D such that |F(I5(Lm ( f ))) − Q(F, Lm ( f ))| = |µm (F, Lm ( f )) − Q(F, Lm ( f ))|

(3.21)

β

≤ cÄ(m /n). From estimate 3.11, we now obtain for all f ∈ K, F ∈ F , © ª |F( f ) − Q(F, Lm ( f ))| ≤ c Ä(²m,K ) + Ä(mβ /n) .

(3.22)

Neural Networks for Functional Approximation

We write Q(F) =:

X

155

dk (F)Pk .

k∈ZD + ,k≤n

Then X

|dk (F)| ≤ c(n, m, p, s)

k∈ZD + ,k≤n

≤ c(n, m, p, s) ≤ c(n, m, p, s)

max

|Q(F, a)|

max

|µm (F, a)|

|a|≤CAm,p,s MK |a|≤CAm,p,s MK

max

P∈52m,s ,kPkp ≤c2 (n,m,p,s)

|F(P)|.

We may then use the networks constructed in Lemma 3.4 to arrive (after an elementary rescaling) at a network approximating Q(F) to an arbitrary degree of approximation on the ball in RD of radius CAm,p,s MK . Moreover, these networks have the form required in Theorem 2.1. In view of estimate 3.22, these networks evaluated at Lm ( f ) give the required degree of approximation to F( f ). Next, we turn to the proof of Theorem 2.2. As expected, the proof of this theorem relies on the notion of the Bernstein widths. If X is a normed linear space, k · k is its norm, A ⊆ X, and N ≥ 1 is an integer, the Bernstein Nwidth of A (with respect to X) is defined as follows. Let XN+1 be the set of all linear subspaces of X having dimension not exceeding N + 1. The Bernstein N-width is given by the expression bN (A) := sup {ρ: ρ f/k f k ∈ A for all f ∈ Y}.

(3.23)

Y∈XN+1

In the context of this article, we take X to be the space C(K) with the natural supremum norm. According to a result of DeVore et al. (1989), 1N (F ) ≥ bN (F ),

N = 1, 2, . . . .

(3.24)

Therefore, we will obtain a lower bound for bN (F ). The idea is the same as in Theorem C in Newman and Shapiro (1964). We provide some details since we are unable to find a precise reference in the context of Bernstein N-widths. Proof of Theorem 2.2. In this proof, we write F := FÄ,2,s . Let m be an integer such that 2m + 1 ∼ (log N)1/s , and D := (2m + 1)s . Following Newman and Shapiro (1964, Lemma 1), we obtain N + 1 points a1 , . . . , aN+1 in RD , such that |ai | ≤ 1/Dα/s ,

i = 1, . . . , N + 1,

(3.25)

156

H. N. Mhaskar and Nahmwoo Hahm

and |ai − aj | ≥ (2Dα/s N1/D )−1 ,

i 6= j, i, j = 1, . . . , N + 1.

(3.26)

Then the functions fi := I5(ai )

(3.27)

are in K, and k fi − fj k2 = |ai − aj | ≥ (2Dα/s N1/D )−1 =: 2δ,

i 6= j, i, j = 1, . . . , N + 1. (3.28)

Next, we define for f ∈ L2 , i = 1, . . . , N + 1, ½ Fi ( f ) :=

Ä(δ − k f − fi k2 ), 0,

if k f − fi k2 < δ otherwise.

(3.29)

on L2 . If Y is the linear Then each Fi is a bounded, continuous, real functionP di Fi be an arbitrary span of Fi , then Y has dimension N + 1. Let F = element of Y. Then kFk := sup |F( f )| = Ä(δ) max |di | . f ∈K

1≤i≤N+1

(3.30)

Using the monotonicity and subadditivity of Ä, if k f − fi k2 , kg − fi k2 ≤ δ, then ¯ ¯ (3.31) |F( f ) − F(g)| = |di | ¯Ä(δ − k f − fi k2 ) − Ä(δ − kg − fi k2 )¯ ≤ |di |Ä(kg − fi k2 − k f − fi k2 |) kFk Ä(k f − gk2 ). ≤ Ä(δ) If k f − fi k2 , kg − fj k2 ≤ δ and i 6= j, then we pick h, h˜ with kh − fi k2 = δ, ˜ 2 ≤ k f − gk2 . (For example, we may take kh˜ − fj k2 = δ, k f − hk2 , kg − hk the “line” joining f and g, and let h and h˜ be the “points” where this “line” ˜ = 0 and, intersects the balls around fi and fj , respectively.) Then F(h) = F(h) using the estimate just proved, ˜ − F(g)| |F( f ) − F(g)| ≤ |F( f ) − F(h)| + |F(h) kFk kFk Ä(k f − hk2 ) + Ä(kh˜ − gk2 ) ≤ Ä(δ) Ä(δ) 2kFk Ä(k f − gk2 ) . ≤ Ä(δ)

(3.32)

Neural Networks for Functional Approximation

157

The case when at most one of f and g are in a ball of radius δ around one of the fi ’s is simpler. We have therefore proved that |F( f ) − F(g)| ≤

2kFk Ä(k f − gk2 ) Ä(δ)

(3.33)

for all f, g ∈ L2 . Consequently, any F ∈ Y with kFk ≤ Ä(δ)/2 is in F , that is, bN (F ) ≥ (1/2)Ä(δ) .

(3.34)

From the definition of D and δ (3.28), we see that N1/D ∼ 1 and δ ∼ (log N)−α/s . Therefore, the proof of Theorem 2.2 is complete in view of the estimate 3.24. 4 Conclusions Given a continuous, not necessarily linear, functional F on an L p space, we have constructed generalized translation networks that provide a nearoptimal approximation to F in terms of the number of neurons involved. P Our networks evaluate a functional of the form N k=1 γk φ(Ak (Lm ( f )) + b), p where f is the input signal in the L -space and φ: Rd → R is the activation function. This setup covers the conventional neural networks (d = 1), as well as radial (elliptical) basis function networks, where d = s and φ is a radially symmetric function. In our construction, Lm is a linear operator mapping f to a judiciously chosen number of coefficients in a polynomial approximation to f . In particular, this operator is independent of both the input signals and the system to be approximated. The remaining parameters are obtained as in Mhaskar (1996) to approximate a function of finitely many real variables, constructed from F and Lm . The parameters’ Ak ’s are full-rank matrices depending only on a norm of F, but are the same for all F having the same norm bound. The parameter b depends only on φ. The parameters γk are continuous linear functionals of F. Explicit formulas are given for all of these parameters; one does not need to train the network in the usual sense. In particular, our constructions are explicit and deterministic and do not involve any iterative and/or optimization techniques, such as backpropagation. Our main objective is to obtain estimates on the number of neurons in terms of the desired accuracy in approximation, and the properties of the functional F and the compact set on which the approximation takes place. Our estimates are valid also when a uniform approximation of a uniformly continuous operator on L p , taking values in a class of bounded, continuous functions, is desired. Lower bounds for the degree of approximation are also obtained. Although this article is mainly of a theoretical nature, the proofs suggest actual constructions based on classical polynomial approximation theory.

158

H. N. Mhaskar and Nahmwoo Hahm

We have not conducted any numerical experiments, but the approximation theory tools used here have been well tested over the last several decades. Acknowledgments The research of H. N. Mhaskar was supported, in part, by National Science Foundation Grant DMS 9404513 and Air Force Office of Scientific Research Grant F49620-93-1-0150. We are grateful to Professor Dr. I. Sandberg and one of the referees for clarifying the history of this problem, as well as to both referees for their valuable suggestions for improving the presentation of this article. References Barron, A. R. 1993. Universal approximation bounds for superposition of a sigmoidal function. IEEE Trans. Information Theory 39, 930–945. Chen, T., and Chen, H. 1993. Approximation of continuous functionals by neural networks with application to dynamical systems, IEEE Trans. Neural Networks 4, 910–918. Chen, T., and Chen, H. 1995a. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Trans. Neural Networks 6, 911–917. Chen, T., and Chen, H. 1995b. Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks. IEEE Trans. Neural Networks 6, 904–910. Cybenko, G. 1989. Approximation by superposition of sigmoidal functions. Mathematics of Control, Signal and Systems 2, 303–314. DeVore, R., Howard, R., and Micchelli, C. A. 1989. Optimal nonlinear approximation. Manuscripta Mathematica 63, 469–478. Dingankar, A. T. 1995. On applications of approximation theory to identification, control, and classification. Ph.D. dissertation, University of Texas at Austin. Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory and neural networks architectures. Neural Computation 7, 219–269. Goffman, C., and Pedrick, G. 1965. First Course in Functional Analysis. Prentice Hall, Englewood Cliffs, NJ. Helmicki, A. J., Jacobsen, C. A., and Nett, C. N. 1991. Control oriented system identification: A worst case deterministic approach in H∞ . IEEE Trans. Auto. Control 3, 1163–1176. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359–366. Levin, A. U., and Narendra, K. S. 1995. Identification using feedforward networks. Neural Computation 7, 349–357. Lorentz, G. G. 1953. Bernstein Polynomials. University of Toronto Press, Toronto. Lorentz, G. G. 1966. Approximation of Functions. Holt, Rinehart and Winston, New York.

Neural Networks for Functional Approximation

159

Mhaskar, H. N. 1995. Versatile gaussian networks. Proceedings of IEEE Workshop on Nonlinear Image and Signal Processing, I. Pitas, ed., pp. 70–73. Halkidiki, Greece, IEEE. Mhaskar, H. N. 1996. Neural networks for optimal approximation of smooth and analytic functions. Neural Computation 8, 164–177. Mhaskar, H. N. Forthcoming. On smooth activation functions. Annals of Mathematics in Artificial Intelligence. Mhaskar, H. N., and Micchelli, C. A. 1995. Degree of approximation by neural and translation networks with a single hidden layer. Advances in Applied Mathematics 16, 151–183. Modha, D. S., and Hecht-Nielsen, R. 1993. Multilayer functionals. In Mathematical Approaches to Neural Networks, J. G. Taylor, ed., pp. 235–260. Elsevier Science, North Holland. Narendra, K. S., and Parthasarathy, K. 1990. Identification and control of dynamic systems using neural networks. IEEE Trans. Neural Networks 1, 4–27. Nevai, P. 1979. Orthogonal polynomials. Memoirs of Amer. Math. Soc. 18. American Mathematical Society, Providence, RI. Newman, D. J., and Shapiro, H. S. 1964. Jackson’s theorem in higher dimensions. In Approximation Theory, Proc. Conf. Math. Res. Inst., Oberwolfach, 1963, P. L. Butzer and J. Korevaar, eds., ISNM 5, pp. 208–219. Birkh¨auser, Basel. Park, J., and Sandberg, I. W. 1991. Universal approximation using radial basis function networks. Neural Computation 3, 246–257. Olver, F. W. J. 1974. Asymptotics and Special Functions. Academic Press, New York. Sandberg, I. W. 1991a. Approximation theorems for discrete time systems. IEEE Trans. Circuits and Systems 38, 564–566. Sandberg, I. W. 1991b. Gaussian basis functions and approximations for nonlinear systems. In Digest of the Ninth Kobe International Symposium on Electronics and Information Sciences, pp. 3-1–3-6. Kobe, Japan. Sobolev, S. L. 1963. Applications of Functional Analysis in Mathematical Physics. Trans. Math. Monographs 7. American Mathematical Society, Providence, RI. Sontag, E. D. N.d. Some topics in neural networks and control, Rep. LS93-02. Siemens Corporate Research. Szego, ¨ G. 1975. Orthogonal Polynomials. Amer. Math. Soc. Coll. Publ. 23. American Mathematical Society, Providence, RI. Timan, A. F. 1963. Theory of Approximation of Functions of a Real Variable. Macmillan Co., New York.

Received January 8, 1996; accepted April 1, 1996.

Communicated by David Cohn and Mark Plutowski

Selecting Optimal Experiments for Multiple Output Multilayer Perceptrons Lisa M. Belue Kenneth W. Bauer, Jr. Department of Operational Sciences, Department of the Air Force, Air Force Institute of Technology, 2950 P Street, Wright Patterson AFB, OH 45433-7765 USA

Dennis W. Ruck Department of Electrical and Computer Engineering, Department of the Air Force, Air Force Institute of Technology, 2950 P Street, Wright Patterson AFB, OH 45433-7765 USA

Where should a researcher conduct experiments to provide training data for a multilayer perceptron? This question is investigated, and a statistical method for selecting optimal experimental design points for multiple output multilayer perceptrons is introduced. Multiple class discrimination problems are examined using a framework in which the multilayer perceptron is viewed as a multivariate nonlinear regression model. Following a Bayesian formulation for the case where the variance-covariance matrix of the responses is unknown, a selection criterion is developed. This criterion is based on the volume of the joint confidence ellipsoid for the weights in a multilayer perceptron. An example is used to demonstrate the superiority of optimally selected design points over randomly chosen points, as well as points chosen in a grid pattern. Simplification of the basic criterion is offered through the use of Hadamard matrices to produce uncorrelated outputs.

1 Introduction In this article a cohesive strategy is developed for selecting a set of design points in experimentation so as to develop accurate multilayer perceptron classifiers in a multiclass discrimination setting. Neural network researchers and practitioners often assume that training data are “provided by a source that we do not control” (MacKay 1992). The efficient selection of data is Neural Computation 9, 161–183 (1997)

c 1997 Massachusetts Institute of Technology °

162

Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck

necessary, however, if experiments to gather data are time-consuming or expensive. (See Rogers and Kabrisky 1991 for a general background and history of multilayer perceptrons.) In experimental design, one seeks to select points from some region of operability such that the design defined by these points is in some sense optimal. Most often this selection process consists of developing some sensible criterion based on an assumed model and using this criterion to obtain a design. In 1943 Wald proposed a determinant criterion. The determinant is used as a measure of the “smallness” of the matrix. In 1959, Box and Lucas demonstrated how such a criterion could be applied to univariate nonlinear models (St. John and Draper 1975). Their criterion is constructed to minimize the volume of the joint confidence ellipsoid for the parameters in a model. This article explores the synthesis of the neural network and experimental design topic areas. The following example illustrates the setting: A researcher (chemist, engineer, biotechnician, etc.—whichever is most familiar) believes that the performance of a system he or she is investigating would be best modeled by a multilayer perceptron with some set of physical characteristics of the system as inputs. A small amount of screening data is available. The researcher wishes to perform additional experiments to develop the most accurate multilayer perceptron classifier possible. Which experiments should be performed? Research has been conducted for selecting data that are expected to improve the generalization performance of neural networks (Cohn et al. 1990; Baum and Lang 1991; Hwang et al. 1991; Plutowski and White 1993). Results based on optimal experimental design have been described by Kinzel and Ruj´an (1990) and Watkin and Rau (1992) for simple perceptrons and by Sollich (1994) for linear perceptrons. Cohn (1994) uses an inverse Hessian method to choose a trajectory through the input space, choosing a single point to be added to the training set at each step. The approach of maximizing the “information matrix” of a multilayer perceptron was described by MacKay (1992) using a sampling theoretic perspective similar to that found in Fedorov (1972). MacKay derives a criterion that measures how useful a data point is expected to be. His strategy centers around maximizing the expected information gain of a new data point. In this article, we develop and illustrate the use of a practical procedure for selecting a new set of data points for multiple output multilayer perceptrons. Specifically, we adopt a Bayesian framework based on the works of Box and Lucas (1959), Draper and and Hunter (1966), M. J. Box and Draper (1972), and others. Our procedure involves the calculation of a statistical criterion that, in the single output case, is equivalent to that proposed by McKay (1992). The criterion can be challenging to compute in the multiple output case, and we offer a practical expedient to this difficulty by using desired outputs based on Hadamard matrices. Finally, we demonstrate the procedure on two example problems.

Selecting Optimal Experiments

163

2 A Design Criterion for Multiple Output Multilayer Perceptrons 2.1 The Criterion for Multiple Response Nonlinear Models. Box and Hunter (1965) assert the need for carefully selecting design points: “If experiments are not carefully planned, the experimental points may be so situated in the space of the variables that the estimates which can be obtained for the parameters are not only imprecise but also highly correlated. Once the data are collected, a statistical analysis, no matter how elaborate, can do nothing to remedy this unfortunate situation. However, by the selection of suitable experimental design in advance, these shortcomings can often be overcome.” We follow a Bayesian formulation developed by Draper and Hunter (1966). First, we establish a design of experiments criteria for the case when the variance-covariance matrix of the responses is known. Next, we establish certain approximations necessary for practical implementation of the criterion. Finally, we show extensions to these arguments when the variancecovariance matrix of the responses is unknown. Let yis (i = 1, . . . , r; s = 1, . . . , N) be the ith response of the sth observation and xs = (xs1 , xs2 , . . . , xsn ), s = 1, . . . , N represent the levels of the independent variables. Then consider the model: yis = fi (xs ; θ ) + εis

(2.1)

where fi are r response functions, θ is a vector of p unknown parameters, and εis denote independent and identically distributed random errors with εis ∼ (0, σ 2 ) and ε = (ε1s , . . . , εrs )T ∼ (0, 6). Define the quantity vij =

N0 X [yis − fi (xs ; θ )][yjs − fj (xs ; θ )]

i, j = 1, . . . , r .

(2.2)

s=1

Suppose N0 observations of the form ys = (y1s , y2s , . . . , yrs )T

s = 1, 2, . . . , N0

(2.3)

have already been obtained, and let y = (y1 , y2 , . . . , yN0 )T . Assume y1 , y2 , . . . , yr follow a multivariate normal distribution. Then, given that the N0 observations are independent, the likelihood is − 12 N0 r

p(y | θ , σ ij ) = (2π)

|A|

1 2 N0

  r r X  1X  exp − σ ij vij  2 i=1  j=1

(2.4)

where A = 6 −1 = {σ ij } = {σij }−1 and the σ ij are assumed known (Box and Draper 1965).

164

Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck

Since little is known a priori about the values of the parameters, a locally uniform prior distribution (Box and Draper 1965) is used so that p(θ )dθ ∝ dθ .

(2.5)

A prior distribution such as this is intended to approximate information about the model parameters that is very vague when compared with the information supplied by the data (Seber and Wild 1989). Now, the posterior distribution for θ after N0 observations is proportional to pN0 (θ | σ ij , y)dθ .

(2.6)

According to Draper and Hunter (1966), this posterior distribution can be used as the prior distribution to determine the posterior distribution after N further observations are made. The new posterior distribution for θ after N0 + N observations is proportional to − 12 (N0 +N)r

pN0 +N (θ | σ ij , y)dθ ∝ (2π)

|A|

1 2 (N0 +N)

  r r X  1X  exp − σ ij vij dθ ,  2 i=1  j=1 (2.7)

with the upper summation limits of vij in equation 2.2 now extending to N0 + N. The goal is to select the N observations in such a way that the posterior density (see equation 2.7) obtained after N0 + N observations is maximized with respect to θ and with respect to the N observations to be chosen. From a Bayesian point of view, all the information concerning the parameters θ is contained in the posterior distribution after the observations are obtained. Therefore, the goal is altered to achieve the best possible posterior distribution by proper choice of design before the observations have actually been obtained (Draper and Hunter 1966). To apply the posterior distribution in equation 2.7, approximations are necessary. Assume that for a region in the θ -space sufficiently close to the maximum likelihood estimates θˆ that the following holds: fi (xs ; θ ) = fi (xs ; θˆ ) +

p X (θ t − θˆ t )f.(t) is

i = 1, . . . , r; s = 1, . . . , N

(2.8)

t=1

where f.(t) is

·

¸ ∂ fi (xs ; θ ) = ∂θt θ =θˆ

t = 1, . . . , p .

(2.9)

Selecting Optimal Experiments

165

Letting δis = yis − fi (xs ; θˆ )

(2.10)

one can write, r r X X

σ ij vij =

i=1 j=1

r r X X

σ ij

i=1 j=1

+

NX 0 +N

δis δjs

s=1

r r X n o X (θ − θˆ )T σ ij F.Ti F.j (θ − θˆ )

(2.11)

i=1 j=1

with 

f.(1) i1 f.(1) i2 .. .

  ˆ F.i (θ ) =    f.(1) i(N0 +N)

f.(2) i1 f.(2) i2 .. .

f.(2) i(N0 +N)

(p)

··· ···

f.i1 (p) f.i2 .. .

···

f.i(N0 +N)

(p)

   .  

(2.12)

There are no terms linear in (θ − θˆ ) due to the definition of θˆ as the maximum likelihood estimate. Draper and Hunter (1966) state in their development that using equation 2.11 in equation 2.7 and performing the appropriate normalization yields: ij

− 12 p

pN0 +N (θ | σ , y) = (2π)

½ ¾ 1 T ˆ ˆ |D| exp − (θ − θ ) D(θ − θ ) , 2 1 2

(2.13)

where D=

r r X X

σ ij F.Ti F.j .

(2.14)

i=1 j=1

By definition, maximization with respect to θ occurs when θ = θˆ , evaluated after N0 + N observations. It thus remains to choose the design points so that the determinant ¯ ¯ ¯ ¯X r ¯ ¯ r X ij T ¯ σ F.i F.j ¯¯ (2.15) ¯ ¯ ¯ i=1 j=1 is maximized with respect to θ and the N design points (Draper and Hunter 1966). For practical implementation, one must calculate the values of F.i

166

Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck

using parameter estimates obtained from the N0 observations rather than all N0 + N observations. M. J. Box and Draper (1972) rely on results from Box and Draper (1965) and Draper and Hunter (1966) to derive a design point criterion for multiresponse design of experiments when 6 is unknown. They begin with the assumption that little is known about the values of the parameters and that their prior distribution will change little over a region in which the likelihood is substantial. This reasoning leads to using independent noninformative priors for θ and 6. A noninformative prior can be thought of as a guess that one hopes is unbiased. Box and Draper derive the marginal posterior distribution of θ as p(θ | y) = C |vij |− 2 N 1

where

C=

·Z

|vij |− 2 N dθ 1

(2.16)

¸−1 (2.17)

is the normalizing constant and vij denotes the entire matrix whose ijth element is vij . Here, the σ ij has been integrated out by comparing the joint posterior distribution with the Wishart distribution (Box and Draper, 1965). Parameter estimation in a nonlinear regression setting can be accomplished 1 by maximizing |vij |− 2 N . The criterion chooses “that value θˆ of θ for which the posterior density is a maximum. This maximum posterior density is a function of the experimental settings adopted, so that if some or all of the experimental settings have not yet been selected, they can be chosen so as to maximize the maximum of the posterior density with respect to θ ” (M. J. Box and Draper 1972). Using the result from equation 2.11 in equation 2.16 and letting F.i (θˆ ) = F.i , ¯( )¯− 12 N N ¯ ¯ X ¯ ¯ p(θ | y) = C ¯ δis δjs + (θ − θˆ )T F.i F.j (θ − θˆ ) ¯ (2.18) ¯ ¯ s=1 where the curly brackets denote the i, jth term of the matrix. Letting vˆ ij = PN ˆ ij } and V−1 = {vˆ ij }, the maximization of s=1 δis δjs , V = {v ¯ n o¯− 1 N ¯ ¯ 2 ¯V + (θ − θˆ )T F.i F.j (θ − θˆ ) ¯

(2.19)

is required. Equivalently, one can maximize the approximation (M. J. Box and Draper 1972) r r X X 1 vˆ ij F.i F.j (θ − θˆ ) . 1 − N(θ − θˆ )T 2 i=1 j=1

(2.20)

Selecting Optimal Experiments

It then remains to choose design points so that the determinant ¯ ¯ ¯ ¯X r ¯ ¯ r X ij T ¯ ¯ ˆ N v F. F. j i ¯ ¯ ¯ ¯ i=1 j=1

167

(2.21)

is maximized. According to Seber and Wild (1989), N V−1 is the maximumlikelihood estimator of 6 −1 . Further, it is appealing that the unknown variance-covariance elements are estimated by their natural estimators— 1/N times the sums of squares and cross products of the difference between observed and modeled responses (M. J. Box and Draper 1972). Since, in the design stage, the residuals of the unperformed experiments are unknown, all data vectors from completed experiments should be used to estimate the elements of the inverse variance-covariance matrix. The multiresponse criterion can be viewed as a weighted sum of a singleresponse criterion taken over all possible response pairs. The criterion is defined as ¯ ¯ ¯ ¯X r ¯ ¯ r X (2.22) vˆ ij F.Ti (X; θˆ )F.j (X; θˆ )¯¯ D = arg max ¯¯ xs ∈R, s=1,...,N ¯ i=1 j=1 ¯ where X = [x1 , x2 , . . . , xN ]T is a set of N data points. Note that when r = 1, the criterion in equation 2.22 is identical to the univariate criterion given by Box and Lucas (1959). 2.2 The Multilayer Perceptron Criterion. The multilayer perceptrons addressed in this paper consist of an input layer, a single hidden layer, and an output layer. The multilayer perceptrons are trained using instantaneous backpropagation with momentum. Sigmoids are applied at the hidden and output layers. We allow any number of output nodes and assume initially that the number of output nodes is equal to the number of classes in the current discrimination problem. Designs for single-output multilayer perceptrons used for two-class discrimination problems are addressed in a companion paper (Belue and Bauer 1995). Typically, the parameters in a multilayer perceptron are called weights, and the set of all weights is denoted w. Since there is no natural ordering for these weights, an ordering will be established here. The weights below the hidden layer will be listed first, with all the weights emanating from an input node listed together. Then the weights above the hidden layer will be listed, with the weights emanating from a hidden node listed together. So, w = (w111 , w112 , . . . , w11m , w121 , w122 , . . . , w12m , . . . , w1nm , ξ11 , ξ21 , . . . , ξm1 , w211 , w212 , . . . , w21r , w221 , w222 , . . . , w22r , . . . , w2mr , ξ12 , ξ22 , . . . , ξr2 )T (2.23)

168

Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck

where n is the number of inputs (dimension of the input vector), m is the number of hidden nodes, r is the number of outputs, and ξjl (j = 1, . . . , r in the output layer and j = 1, . . . , m in the hidden layer) are the weights connected to the bias elements for the lth layer. The dimension of this weight vector is given by p = (n + 1)m + (m + 1)r .

(2.24)

The input vectors will be denoted xs = (xs1 , xs2 , . . . , xsn ), s = 1, . . . , N. The nonlinear regression responses are called outputs in the multilayer perceptron arena and will be denoted zjs for the jth output node’s value given the sth input vector. The N × p matrix of first partials for output j in multilayer perceptron notation is ( s) ∂zj s = 1, . . . , N; j = 1, . . . , r; t = 1, . . . , p (2.25) F.j (w) = ∂wt where  1  wd mt e,t−m(d mt e−1) wt = w2d t e,t−r(d t e−1)−(n+1)m  (n+1)m+r  (n+1)m+r

for t ≤ (n + 1)m for (n + 1)m < t

(2.26)

≤ (n + 1)m + (m + 1)r

and dαe is the smallest integer greater than α. Then, for lower layer weights, ∂zjs

1s s = zjs (1 − zjs )w2ij x1s i (1 − xi )xk

∂w1ki

(2.27)

where x1s i is the activation of the ith hidden node given the input vector s. Similarly for upper layer weights, ∂zjs ∂w2ij

= zjs (1 − zjs )x1s i .

(2.28)

The elements of the estimated variance-covariance matrix for the outputs based on N exemplars are: ˆ ij = 6 =

1 vˆ ij N N 1 X N

s=1

[dsi − zsi ][djs − zjs ]

(2.29)

Selecting Optimal Experiments

169

where dsi is the desired output for the ith output node given the sth exemplar and zsi is the actual output of the ith output node for the sth exemplar. The design point criterion for a multiple-output multilayer perceptron follows directly from the theoretical development, which begins in equation 2.1. This development assumes the presence of gaussian noise. Such an assumption is appropriate for regression problems but may not be appropriate for classification problems (Rummelhart 1995). In the example classification problems that follow, we have used gaussian noise as an approximation. Note that in order to calculate F.j (w), w, or some estimate of it, must be ˆ is obtained by training a multilayer perceptron known. An initial estimate w ˆ can be formed. ˆ is determined, then F.j (w) on existing data vectors. Once w A second use of this initial data set is to determine the appropriate network architecture. Steppe et al. (1995) develop methods for feature input and model selection. The design point criterion using multilayer perceptron notation is ¯ ¯ ¯ ¯X r ¯ ¯ r X ˆ j (X; w) ˆ ¯¯ . (2.30) vˆ ij F.Ti (X; w)F. D = arg max ¯¯ xs ∈R, s=1,...,N ¯ i=1 j=1 ¯ Our proposed algorithm selects and queries several design points at a time, in contrast to selecting and querying a single design point. The obvious advantage of choosing several points is that multilayer perceptron training time is reduced; however, this advantage is tempered by the costs inherent in obtaining the extra data points. 3 Method of Maximization Powell’s maximization method was employed for finding the optimal design points (Powell 1973; Reklaitis et al. 1983; Press et al. 1989; Vanderplaats 1984). Powell’s method is a zero-order algorithm, in that only evaluations of the original function are used to determine the maximum. According to Vanderplaats (1984), Powell’s method, including its subsequent modifications, is one of the most efficient and reliable of the zero-order methods. The algorithm is based on the concept of conjugate directions. Given an n × n symmetric matrix H, the directions S(1) , S(2) , . . . , S(r) , r ≤ n are said to be H conjugate if the directions are linearly independent and (Si )T HS j = 0

i 6= j .

(3.1)

The significance of conjugacy is that given a quadratic function, the function will be maximized in n or fewer conjugate search directions. (See Belue 1995 and Reklaitis et al. 1983 for more details on the P method.) Pr vˆ ij F.Ti F.j | requires As a practical matter, the maximization of | ri=1 j=1 some region of operability over which to search. We assumed that all data

170

Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck

sets were normalized so that each feature was in the range [0, 1]. This is standard practice for multilayer perceptron inputs (Rogers and Kabrisky 1991). Powell’s method is an unconstrained maximization algorithm; hence, it was modified so that after a locally good search direction is established, a constrained line maximization is performed to find the optimum point along the search direction. Powell’s algorithm will converge in n search directions if the function is quadratic. Reklaitis et al. (1983) state that “it can be proven rigorously under reasonable hypothesis that the method will converge to a local maximum and will do this at at a superlinear rate.” Since convergence is to a local maximum (and not global), different starting points will result in different solutions. Therefore, multiple runs were accomplished, and the run producing the maximum value of the determinant was used. Finally, since Powell’s method does not guarantee global optimality, it is possible that the generated design points will be suboptimal. 4 Example Discrimination Problem As a first demonstration of the multiresponse criterion, a two-dimensional, four-class discrimination problem was used. We trained the multilayer perceptron 30 times using a subset of 100 of the training vectors. Each of the resulting weight vectors was tested on a remaining (disjoint) set of 500 training vectors; we chose the weight vector with the lowest classification error as a starting point for calculating the criterion. Classification error is defined as the percentage of vectors incorrectly classified in a data set. Figure 1 shows the discrimination problem and the multilayer perceptron boundary resulting from this initial weight vector. Powell’s method was then used to maximize the determinant criterion and choose 74 (letting N = p) design points. The chosen design points were classified according to the truth model and added to the training set. The multilayer perceptron was then retrained with the entire set of 174 exemplars (100 initial + 74 design points). Figure 2 shows the resulting design points and the boundary formed by the trained multilayer perceptron. For purposes of comparison, the multilayer perceptron was also trained with 74 randomly chosen points added to the training set. Table 1 shows the average test set classification error (after 1000 epochs) for 30 independently trained multilayer perceptrons for the initial training data and for the design point method training data as compared to randomly chosen design points. 5 Reducing the Complexity of Design Point Determination The design point criterion (see equation 2.30) is relatively simple in form. However, it requires a large number of calculations. To calculate the design point criterion for a single candidate set of design points requires r2 evaluations of the form vˆ ij F.Ti F.j . Each matrix F.j requires Np evaluations of the

Selecting Optimal Experiments

Figure 1: “True” boundary and initial learned boundary.

Figure 2: Design points and resulting boundary.

171

172

Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck

Table 1: Summary of Test Set Classification Error.

Method

Training set

Initial

100 initial training vectors

Random

100 initial training vectors + 74 random points 100 initial training vectors + 74 chosen points

Design points

Classification error (%) (30 networks) Mean ± std. error 26.85 ± .93 20.16 ± .49 18.27 ± .49

∂z

derivative ∂wj . In turn, these derivatives require evaluation of the multilayer perceptron at each of N points. All of these calculations are required for a single evaluation of a single term in the design point criterion. Powell’s algorithm may require thousands of evaluations of the criterion. Needless to say, simplification of the criterion would be a desirable step toward practical implementation. 5.1 A Reduced Criterion. The multiresponse design point criterion (equation 2.30) could be simplified if the responses were uncorrelated, that is, if N V−1 (the maximum likelihood estimator of 6 −1 ) were a diagonal matrix. Then, rather than r2 terms of the form vˆ ij F.Ti F.j there would be r terms of that form. One way to accomplish this goal is to “preset” the desired outputs in a manner that will produce uncorrelated actual outputs, resulting in a nearly diagonal estimated variance-covariance matrix. We attempt to produce such a situation by training the multilayer perceptron to desired outputs based on the rows of modified Hadamard matrices (Roberts 1989). The Hadamard matrix is a square array whose elements are only +1 and −1 and whose rows and columns are orthogonal to one another. A symmetrical Hadamard matrix with the first column containing +1’s is known as the normal form for the Hadamard matrix. The lowest order Hadamard matrix is of order two: · H2 =

1 1

1 −1

¸ .

(5.1)

Higher-order matrices, restricted to having powers of two, can be obtained from the recursive relationship H N = H N ⊗ H2 2

(5.2)

Selecting Optimal Experiments

173

Table 2: Traditional Desired Outputs for Four Class Discrimination. Class 1

Class 2

Class 3

Class 4

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

Output 1 Output 2 Output 3 Output 4

Table 3: Desired Outputs for Four Class Discrimination. Class 1

Class 2

Class 3

Class 4

1 1 1 1

1 0 1 0

1 1 0 0

1 0 0 1

Output 1 Output 2 Output 3 Output 4

where N is a power of 2 and ⊗ denotes the Kronecker product defined as    A⊗B= 

a11 B a21 B .. . am1 B

a12 B a22 B .. . am2 B

··· ··· ···

a1m B a2m B .. . amm B

    

(5.3)

(Beauchamp 1984). For classification problems, multilayer perceptron desired outputs are typically binary words denoting class membership. A “traditional” coding scheme is shown in Table 2. Note that if we estimate the correlation between output 1 and output 2 based on these four vectors, there is a nonzero correlation. Similarly, the correlation is nonzero for the other outputs. Adjusting the Hadamard matrix (see Table 3) to include 0’s rather than −1’s, we find that the estimated correlation is 0 for all outputs. Notice that output 1 is the same (one) for each class and makes no contribution to classification. Therefore, this output will be removed, leaving three outputs for the classification of four classes. Assuming an equal number of training exemplars from each class, the idealized form of the variancecovariance matrix for the three desired outputs is 

0.3333 ˜ = 0 6 0

0 0.3333 0

 0 0 . 0.3333

(5.4)

174

Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck

Table 4: Desired Outputs for Seven Class Discrimination.

Output 1 Output 2 Output 3 Output 4 Output 5 Output 6 Output 7

Class 1

Class 2

Class 3

Class 4

Class 5

Class 6

Class 7

1 1 1 1 1 1 1

1 0 1 0 1 1 1

1 1 0 0 1 1 1

1 0 0 1 1 1 1

1 0.5 0.5 0.5 0 1 1

1 0.5 0.5 0.5 0.8 0 1

1 0.5 0.5 0.5 0.8 0.83333 0

Belue (1995) offers an algorithm for forming uncorrelated outputs in the case where the number of classes is not a power of two. The desired outputs for seven class discrimination (required in Section 5.3) are shown in Table 4. Actual outputs will not equal desired outputs but should approach desired outputs as training progresses and network accuracy improves. Then, in calculating the estimated variance-covariance matrix for use with the design point criterion, one would expect to find nearly zero off-diagonal elements. If the form is approximately diagonal, the design point criterion can be reduced to ¯ ¯ r ¯X 1 T ¯¯ ¯ arg max N F. F. ¯ , D= ¯ ˜ ii i j ¯ xs ∈R, s=1,...,N ¯ i=1 6

(5.5)

significantly reducing the number of calculations. Design points are then chosen based on this reduced criterion. 5.2 Example Problem Using Reduced Criterion. The four-class problem from Section 4 is used to demonstrate the simplified criterion. The desired outputs for only 50 input vectors were coded as in Table 3, and a multilayer perceptron with 10 hidden nodes was trained. The inverse varianceˆ −1 , estimated using the outputs from the trained netcovariance matrix, 6 work is 

ˆ 6

−1

12.4911 =  1.7141 −6.2505

1.7141 34.2288 −2.5606

 −6.2505 −2.5606 . 12.9611

(5.6)

Selecting Optimal Experiments

175

This matrix is not approximately diagonal. However, at least four of the elements could be considered 0, reducing the matrix to 

ˆ −1 6

12.4911 =  0.0000 −6.2505

0.0000 34.2288 0.0000

 −6.2505 0.0000  . 12.9611

(5.7)

The estimated variance-covariance matrix will be different for every multilayer perceptron trained, yet one would expect an approximately equivalent structure. ˆ −1 stems from the The relatively large value of the (2, 2) element in 6 class coding and the multilayer perceptron training. The second desired output is coded 1 for both class 1 and class 2 and 0 for class 3 and class 4. Since the multilayer perceptron can accurately discriminate between a superclass containing classes 1 and 2 and a superclass containing classes 3 and 4 (Fig. 1 shows an example of this phenomenon), the estimated variance of the second output is small. This small variance results in a large value in the inverse matrix. The design criterion reduces to approximately ˆ = |12.4911F.T F.1 − 6.2505F.T F.3 + 34.2288F.T F.2 − 6.2505F.T F.1 |D| 1 1 2 3 + 12.9611F.T3 F.3 | .

(5.8)

The original criterion with 16 terms has been reduced by 69%. The reduced criterion was used to determine 74 design points. Using the truth model, these 74 design points were classified and added to the training set. Figure 3 shows the design points and the resulting multilayer perceptron boundary. Table 5 compares the average test set classification error (after 1000 epochs) for the simplified criterion method, the “full” criterion method, and a randomly selected set of points. 5.3 Armor-Piercing Incendiary Projectile Application. In this section, we demonstrate the use of the design point criterion on a more realistic and challenging problem. An armor-piercing incendiary (API) projectile is a bullet that can perforate light armor. The projectile contains a flammable mixture encased in the nose of the projectile body. The design of the projectile allows the jacket over the nose to deform or peel off on impact of the target skin. Consequently the incendiary mixture will flash as the steel core of the bullet continues its flight. The intense flash following jacket failure of an API is known as incendiary functioning (IF). There are five classifications for IF: complete, partial, slowburn, delayed, and nonfunctioning. Prediction equations for API penetration mechanics are an essential part of aircraft vulnerability analysis. U.S. government agencies perform test shots in order to gather the data to be used in the development of prediction models. These test shots are time-consuming due to the necessary

176

Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck

Figure 3: Design points and resulting boundary for reduced criterion.

Table 5: Summary of Test Set Classification Error (30 Runs). Classification error (%) (30 networks) mean ± std. error

Method

Training set

Initial

50 initial training vectors

Full criterion

50 initial training vectors + 74 chosen points (full criterion)

20.36 ± .49

Simplified criterion

50 initial training vectors + 74 chosen points (simplified criterion)

23.25 ± .52

Random

50 initial training vectors + 74 random points

25.37 ± .55

28.17 ± .68

preparations before each shot, and expensive due to the high costs inherent in the testing itself. This example highlights the real-world benefits of selecting a set of points prior to testing rather than setting up a sequence of test shots, each of which depends on the previous shots. Methods for predicting API functioning have been developed based on the physical characteristics of the system. Force (the pressure opposing the projectile’s penetration) and impulse (the time force is applied) have been

Selecting Optimal Experiments

177

identified as the defining criteria for classifying IF types. Stress (the force per unit area) is given as σ at normal obliquity. The term 1σ is an additive correction given for other obliquity angles. The sum of the two stresses, σ and 1σ , equals the total stress effect. A time parameter t∗ is used in place of impulse. With σ + 1σ and t∗ calculated, stress-time plots are created. These plots delineate zones where the five IF classes occur. Figure 4 provides an example of one of these plots. The time-stress plot in Figure 4 was rescaled and normalized to produce the plot shown in Figure 5. Each of the zones shown is numbered class 1 through class 7. This plot will be used as the truth model, and data will be generated from it. In this fashion, we can generate data with a definite real-world flavor but avoid testing costs. Further, the problem has the added appeal of having a low-dimensional feature space so that graphical presentation of the analysis is feasible. The data were generated by simply drawing two independent random deviates from a uniform distribution bounded between 0.0 and 1.0. These pairs become coordinates on the scaled time-stress plot, and class membership was so determined. We started with 100 randomly drawn data points input to a multilayer perceptron with 10 middle nodes. To study the performance of the reduced criterion, the desired outputs for the 100 input vectors were coded as in Table 4, and a multilayer perceptron with 10 hidden nodes was ˆ −1 , was estimated using trained. The inverse variance-covariance matrix ,6 outputs from the trained network. The matrix was not completely diagonal, however; at least 18 of the elements could be considered 0, reducing the matrix to:   6.0117 0.0000 −3.8187 0.0000 0.0000 0.0000  0.0000 12.5071 0.0000 0.0000 0.0000 4.6405    0.0000 7.6334 −3.1881 0.0000 −3.5436 . (5.9) ˆ −1 = −3.8187 6  0.0000 0.0000 −3.1881 9.3183 0.0000 3.6686    0.0000 0.0000 0.0000 0.0000 10.7487 11.5373 0.0000 4.6405 −3.5436 3.6686 11.5373 82.8441 The resulting reduced design point criterion contains 63% fewer terms than the full criterion. The reduced criterion was used to determine 30 design points. Using the truth model, these 30 design points were classified and added to the training set. Table 6 displays the average test set classification error for the initial data, the initial data plus 30 points selected by the simplified criterion, and the initial data plus 30 randomly selected points. Here, the error when using the simplified criterion method is (in a statistical sense) significantly less than the error for the randomly selected points (α = 0.05). 6 Summary In this article, a Bayesian technique for selecting multilayer perceptron data is developed. The technique selects these training exemplars to best update

178

Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck

Figure 4: Stress-time plot.

the multilayer perceptron weights. At the core of the methodology is a statistical criterion based on minimizing the size of the joint confidence ellipsoid for the weights in the multilayer perceptron. The methodology is illustrated with a simple four-class example. A more substantive seven-class discrimination problem is then discussed dealing with the classification of the performance of armor-piercing incendiary projectiles. The work presented here extends previous research in several areas. First, a Bayesian multivariate criterion is developed for use with multiple output neural networks. Second, the implementation of Powell’s algorithm illustrates a practical method of optimizing the design point criterion. Next, mul-

Selecting Optimal Experiments

179

Figure 5: Truth model for seven-class stress-time plot discrimination problem.

Table 6: Stress-Time Plot Average Test Set Classification Error Comparisons (30 Runs). Classification error (%) (30 networks) Mean ± std. error

Method

Training set

Initial

100 initial training vectors

Simplified criterion

100 initial training vectors + 30 chosen points (simplified criterion)

19.50 ± .85

100 initial training vectors + 30 random points

22.43 ± 1.05

Random

27.38 ± .75

tiple design points are chosen on a single iteration rather than the greedy selection of a single next design point. Having developed a multioutput criterion, we then show how this criterion can be simplified with an elementary modification to the values of the desired outputs. In summary, we show that by choosing data according to a Bayesian criterion, it is possible to achieve significantly better performance for a given amount of data.

180

Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck

Figure 6: Stress-time plot problem: Design points and resulting multilayer perceptron boundary.

Selecting Optimal Experiments

181

Figure 7: Stress-time plot average test set classification error comparisons (30 runs).

References Baum, E. B., and Lang, K. J. 1991. Constructing hidden units using examples and queries. In Advances in Neural Information Processing Systems 3, R. Lippmann, J. Moody, and D. Touretzy, eds., pp. 904–910. Morgan Kaufmann, San Mateo, CA. Beauchamp, K. G. 1984. Applications of Walsh and Related Functions. Academic Press, New York. Belue, L. M. 1995. Selecting Optimal Experiments for Feedforward Multilayer Perceptrons. Ph.D. dissertation. Air Force Institute of Technology, Wright-Patterson AFB, OH. Belue, L. M. and Bauer, K. W., Jr. 1995. Selecting and ranking optimal experiments for single output multilayer perceptrons. Submitted for publication to Journal of Statistical Computation and Simulation. Box, G. E. P., and Draper, N. R. 1965. The bayesian estimation of common parameters from several responses. Biometrika 52, 355–365. Box, M. J., and Draper, N. R. 1972. Estimation and design criteria for multiresponse non-linear models with non-homogeneous variance. Applied Statistics 21, 13–24.

182

Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck

Box, G. E. P., and Hunter, W. G. 1965. Sequential design of experiments for nonlinear models. Proceedings of IBM Scientific Computing Symposium in Statistics, 113–137. Box, G. E. P., and Lucas, H. L. 1959. Design of experiments in non-linear situations. Biometrika 46, 77–90. Cohn, D. A. 1994. Neural network exploration using optimal experimental design. In Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector, eds., pp. 679–686. Morgan Kaufmann, San Mateo, CA. Cohn, D. A., Atlas, L., and Ladner, R. 1990. Training connectionist networks with queries and selective sampling. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 566–573. Morgan Kaufmann, San Mateo, CA. Draper, N. R., and Hunter, W. G. 1966. Design of experiments for parameter estimation in multi-response situations. Biometrika 53, 525–533. Fedorov, V. V. 1972. Theory of Optimal Experiments. Academic Press, New York. Hecht-Nielsen, R. 1989. Neurocomputing. Addison-Wesley, Reading, MA. Hwang, J. N., Choi, J. J., Oh, S., and Marks, R. 1991. Query-based learning applied to partially trained multilayer perceptrons. IEEE Transactions on Neural Networks 2, 131–136. Kinzel, W., and Ruj´an, P. 1990. Improving a network generalization ability by selecting examples. Europhysics Letters 13, 473–477. MacKay, D. J. C. 1992. Information-based objective function for active data selection. Neural Computation 4, 590–604. Plutowski, M., and White, H. 1993. Selecting concise training sets from clean data. IEEE Transactions on Neural Networks 4, 305–318. Powell, M. J. D. 1973. On search directions for minimization algorithms. Math Programming 4, 193–201. Press, W. H. et al. 1989. Numerical Recipes. Cambridge University Press, Cambridge. Reklaitis, G. V., et al. 1983. Engineering Optimization Methods and Applications. John Wiley, New York. Roberts, J. E. 1989. Influence of target-vector code selection on the performance of a neural-network word recognizer. In Proceedings of IEEE/INNS International Joint Conference on Neural Networks, II-628. Rogers, S. K., and Kabrisky, M. 1991. An Introduction to Biological and Artificial Neural Networks for Pattern Recognition. SPIE Optical Engineering Press, Bellingham, WA. Rummelhart, D. 1995. Backpropagation: Theory, Architectures, and Applications. Erlbaum, Hillsdale, NJ. St. John, R. C., and Draper, N. R. 1975. D-optimality for regression designs: A review. Technometrics 17, 15–23. Seber, G. A. F. and Wild, C. J. 1989. Nonlinear Regression. John Wiley, New York. Sollich, P. 1994. Query construction, entropy and generalization in neural network models. Physical Review E. 49, 4637–4651. Steppe, J. M., Bauer, K. W., Jr., and Rogers, S. K. 1995. Integrated feature and architecture selection. Submitted to IEEE Transactions on Neural Networks.

Selecting Optimal Experiments

183

Vanderplaats, G. N. 1984. Numerical Optimization Techniques for Engineering Design with Applications. McGraw-Hill, New York. Watkin, T. L. H., and Rau, A. 1992. Selecting examples for perceptrons. J. Phys. A: Math. Gen. 25, 113–121.

Received April 7, 1995; accepted April 18, 1996.

Communicated by Jude Shavlik

A Penalty-Function Approach for Pruning Feedforward Neural Networks Rudy Setiono Department of Information Systems and Computer Science, National University of Singapore, Kent Ridge, Singapore 0511, Republic of Singapore

This article proposes the use of a penalty function for pruning feedforward neural network by weight elimination. The penalty function proposed consists of two terms. The first term is to discourage the use of unnecessary connections, and the second term is to prevent the weights of the connections from taking excessively large values. Simple criteria for eliminating weights from the network are also given. The effectiveness of this penalty function is tested on three well-known problems: the contiguity problem, the parity problems, and the monks problems. The resulting pruned networks obtained for many of these problems have fewer connections than previously reported in the literature.

1 Introduction We are concerned in this article with finding a minimal feedforward backpropagation neural network for solving the problem of distinguishing patterns from two or more sets in N-dimensional space. Backpropagation feedforward neural networks have been gaining acceptance as the method of choice for this pattern classification problem. One of the difficulties with using feedforward neural networks is the need to determine the optimal number of hidden units before the training process can begin. Too many hidden units may lead to overfitting of the data and poor generalization; too few hidden units may not give a network that learns the data. Two different approaches have been proposed to overcome the problem of determining the optimal number of hidden units required by an artificial neural network to solve a given problem. The first approach begins with a minimal network and adds more hidden units only when they are needed to improve the learning capability of the network. The second approach begins with an oversized network and then prunes redundant hidden units. Much research has been done on algorithms that start with a minimal network and dynamically construct the final network. These algorithms include the dynamic node creation (Ash 1989), the cascade correlation algoNeural Computation 9, 185–204 (1997)

c 1997 Massachusetts Institute of Technology °

186

Rudy Setiono

rithm (Fahlman and Lebiere 1990), the tiling algorithm (Mezard and Nadal 1989), the self-organizing neural network (Tenorio and Lee 1990), and the upstart algorithm (Frean 1990). The aim of these algorithms is to find a network with the simplest architecture possible that is capable of correctly classifying all its input patterns. Interest in algorithms that remove hidden units from an oversized network has also been growing (Chauvin 1989; Chung and Lee 1992; Mozer and Smolensky 1989; Hanson and Pratt 1989). Instead of removing unnecessary hidden units, individual weights (connections) in the network may also be removed to increase the generalization capability of a neural network (Le Cun et al. 1990; Hassibi and Stork 1993; Thodberg 1991). Typically, methods for removing weights from the network involve adding a penalty term to the error function (Hertz et al. 1991). It is hoped that by adding a penalty term to the error function, unnecessary connections will have small weights, and therefore the complexity of the network can be reduced significantly by pruning. The simplest and most commonly used penalty term is the sum of the squared weights. After the training process has been completed—it has converged to a local minimum point—a measure of the saliency of each connection in the network is computed. The saliency measure gives an indication on the expected increase in the error function after the corresponding connection is eliminated from the network. In the pruning methods optimal brain damage (Le Cun et al. 1990) and optimal brain surgeon (Hassibi and Stork 1993), the saliency of each connection is computed using a second-order approximation of the error function near a local minimum. If the saliency of a connection is below a certain threshold, the connection is removed from the network. The resulting network needs to be retrained if the increase in the error function is larger than a predetermined acceptable increase. In this article, a simple scheme for removing redundant weights from a network that has been trained to minimize a penalty function is proposed. The effectiveness of the proposed pruning scheme is tested on several wellknown problems such as the contiguity problem, the monks problems, and the parity problems. In many cases, the resulting networks contain fewer connections than previously reported in the literature. The organization of this article is as follows. In Section 2, we describe the penalty function and network pruning scheme. Experimental results are reported in Section 3. Finally in Section 4, a brief discussion and final remarks are given.

2 Neural Network Training and Pruning 2.1 Neural Network Training. Let us consider the fully connected threelayer network depicted in Figure 1. The error function associated with this

A Penalty-Function Approach for Pruning Feedforward Neural Networks 187

Figure 1: Fully connected feedforward neural network with three hidden units and three output units.

network is normally defined as the sum of the squared errors: E(w, v) =

C k X 1X (Spi − tpi )2 , 2 i=1 p=1

(2.1)

where: • k is the number of patterns. • C is the number of output units. • tpi = 0 or 1 is the target value for pattern xi at output unit p, p = 1, 2, . . . , C. • Spi is the output of the network at output unit p.   h X T f ((xi ) wj )vpj  Spi = f  j=1

(2.2)

188

Rudy Setiono

• h is the number of hidden units in the network. • xi is an N-dimensional input pattern, i = 1, 2, . . . , k. • wm is an N-dimensional vector of weights for the arcs connecting the input layer and the mth hidden unit, m = 1, 2 . . . , h. The weight of the connection from the `th input unit to the mth hidden unit is denoted by wm` . • vm is a C-dimensional vector of weights for the arcs connecting the mth hidden unit and the output layer. The weight of the connection from the mth hidden unit to the pth output unit is denoted by vpm . The activation function f (y) is usually the sigmoid function, σ (y) = 1/(1 + e−y ), or the hyperbolic tangent function, δ(y) = (ey − e−y )/(ey + e−y ). The main difference between these two functions is their range. The range of σ (y) is [0, 1], while the range of δ(y) is [−1, 1]. Since we will be using the cross-entropy error function, which involves taking the logarithm of the predicted network outputs, the sigmoid function is applied at the output units. Karnin (1990) used the hyperbolic tangent function as the activation function at the hidden units. We have also used this function to obtain all the results reported in this paper. It has been suggested by several authors (Lang and Witbrock 1988; van Ooyen and Nienhuis 1992) that the cross-entropy error function F(w, v) = −

C k X X

tpi log Spi + (1 − tpi ) log(1 − Spi )

(2.3)

i=1 p=1

improves the convergence of the training process. The components of the gradient of this function are ∂Spi ∂F(w, v) ∂F(w, v) = × ∂wm` ∂Spi ∂wm` =

C k X X [epi × vpm × (1 − δ(xi wm )2 ) × xi` ] i=1 p=1

∂Spi ∂F(w, v) ∂F(w, v) = × ∂vpm ∂Spi ∂vpm =

k X [epi × δ(xi wm )] i=1

A Penalty-Function Approach for Pruning Feedforward Neural Networks 189

for all m = 1, 2, . . . , h, ` = 1, 2, . . . , n, and p = 1, 2, . . . , C, with the error for each pattern at output unit p defined as epi = Spi − tpi , i = 1, 2, . . . , k. The number of output units C depends on the number of different classes in the data. When there are only two classes, a single output unit is sufficient. Each pattern can be given a target value equal to 0 or 1 depending on its class label. When there are more than two classes, C is set to the number of classes. Each pattern xi in class Cc , c = 1, 2, . . . , C is given a C-dimensional target vector ti such that tpi = 1 if p = c and tpi = 0 otherwise. The difference between the gradient of the functions E(w, v) and F(w, v) is the presence of the product Spi (1 − Spi ) in the gradient of the sum of the squared error function E(w, v). As pointed out in van Ooyen and Nienhuis (1992), when the values of Spi are converging to either 0 or 1, the magnitude of the gradient of E(w, v) becomes small. This leads to a slowdown in the convergence of any minimization algorithm that utilizes gradient information to move from one iterate to the next. 2.2 Choice of Penalty Function and Criteria for Weight Elimination. When a network is to be pruned, it is a common practice to add a penalty term to the error function during training. Usually the penalty term is the sum of the squared weights   n C h X h X X X 2  w2m` + vpm , (2.4) P1 (w, v) = ²/2  m=1 `=1

m=1 p=1

where ² is a small positive weight decay constant (Hinton 1989). This quadratic penalty term is used to discourage the weights from taking large values. If second-order approximations of the error function are employed to find a local minimum of the error function, the addition of this quadratic term also contributes to the stability of the training process. The penalty term 2.4 adds ² to the diagonal of the second derivative matrix of the error function. With this perturbation, it is more likely for the matrix to be positive definite, and hence, a descent direction can be obtained. There are some drawbacks to using this penalty term. When the backpropagation method is used to train the neural network, the addition of this quadratic penalty term will cause all the weights to decay exponentially to zero at the same rate (Hanson and Pratt 1989). The quadratic penalty also disproportionately penalizes large weights. To remedy this problem, the following penalty function has been suggested (Weigend et al. 1990):   2 n C h X h X X X vpm w2m` . + (2.5) P2 (w, v) = ²/2  2 2 1 + vpm m=1 `=1 1 + wm` m=1 p=1

190

Rudy Setiono

Since the value of the function f (w) = w2 /(1 + w2 ) is small when w is close to zero and approaches 1 as w becomes large, the function 2.5 can be considered a measure of the total number of nonzero weights in the network. The derivative function f 0 (w) = 2w/(1 + w2 )2 indicates that the backpropagation training will be very little affected with the addition of the penalty function 2.5 for weights having large values. Selecting suitable values for the backpropagation learning rate and ² will cause small weights to decay at a higher rate than large weights. A disadvantage of this penalty function is that it does not make any distinction between large and very large weights; that is, there exists a sufficiently large w1 such that for all |w| ≥ w1 , 1 − ν ≤ f (w) ≤ 1 for any small positive scalar ν. The reason we would like to have weights that are not too large is linked to the criteria for weight elimination developed below. Recall that given an input pattern xi , the output of the network at unit p is given according to the equation   h X T δ((xi ) wj )vpj  . Spi = σ 

(2.6)

j=1

The derivatives of Spi with respect to the weights of the network are as follows: ∂Spi = Spi (1 − Spi )vpm xi` (1 − δ(xi wm )2 ) ∂wm` ½ ∂Spi if p = q S (1 − Spi )δ(xi wm ) = pi 0 otherwise. ∂vqm

(2.7) (2.8)

For the moment, let us consider Spi as a function of a single variable corresponding to the connection between the `th input unit and the mth hidden unit. From the mean value theorem (Grossman 1995, p. 173), we have that Spi (w) = Spi (wm` ) +

∂Spi (wm` + λ(w − wm` )) (w − wm` ), ∂wm`

(2.9)

where 0 < λ < 1. The maximum value of the product Spi (1 − Spi ) is 0.25, and the maximum value of (1 − δ(xi wm )2 ) is 1. Assuming that xi` ∈ [0, 1], from equation 2.7 we obtain the bound ¯ ¯ ¯ ∂Spi (w) ¯ ¯ ≤ |vpm |/4, ∀w ∈ R. ¯ (2.10) ¯ ¯ ∂w m` Combining equations 2.9 and 2.10 and letting w equal to zero, we obtain ¯ ¯ ¯Spi (0) − Spi (wm` )¯ ≤ |vpm wm` |/4. (2.11)

A Penalty-Function Approach for Pruning Feedforward Neural Networks 191

Equation 2.11 gives an upper bound on the change in the output of the network when the weight wm` is eliminated. Similarly, by considering Spi as a function of a single variable vpm that corresponds to the connection between the mth hidden unit and the pth output unit, from equation 2.8 we have the bound ¯ ¯ ¯ ∂Spi (v) ¯ ¯ ≤ 1/4, ∀v ∈ R. ¯ ¯ ¯ ∂v pm

(2.12)

Hence, the change in the pth output of the network after the weight vpm has been eliminated is bounded by ¯ ¯ ¯Spi (0) − Spi (vpm )¯ ≤ |vpm |/4.

(2.13)

A pattern is correctly classified if the following condition is satisfied, |epi | = |Spi − tpi | ≤ η1 , p = 1, 2, . . . , C,

(2.14)

where η1 ∈ [0, 0.5). Suppose now that with the original fully connected network, we are able to meet a prespecified accuracy requirement. The bound in equation 2.11 shows that we can set wm` to zero without deteriorating the classification rate of the network if the product |vpm wm` | is sufficiently small. Similarly, the bound in equation 2.13 suggests that vpm can be set to zero if its magnitude is sufficiently small. We have the following two propositions. Proposition 1. Let (w, v) be a set of weights of a network with an accuracy rate of R, with the condition in equation 2.14 satisfied by each correctly classified input pattern. Let maxp |vpm wm` | ≤ 4η2 , and let S˜pi be the output of the network with wm` set to zero; then for each pattern xi that is correctly classified by the original network, the following holds: |˜epi | = |S˜pi − tpi | ≤ η1 + η2 , p = 1, 2, . . . , C. Proposition 2. Let (w, v) be a set of weights of a network with an accuracy rate of R, with the condition in equation 2.14 satisfied by each correctly classified input pattern. Let |vpm | ≤ 4η2 , and let S˜pi be the output of the network with vpm set to zero. Then for each pattern xi that is correctly classified by the original network, the following holds: |˜epi | = |S˜pi − tpi | ≤ η1 + η2 , p = 1, 2, . . . , C. As long as the sum η1 + η2 is less than 0.5, Propositions 1 and 2 ensure that the pruned network still achieves an accuracy rate of R. After wm` or vpm is set to zero, the absolute errors of each correctly classified pattern can

192

Rudy Setiono

be made arbitrarily small by increasing the weights from the hidden units to the output units, as Proposition 3 suggests. Proposition 3. Let (w, v) be a set of weights of a network with an accuracy rate of R such that for each correctly classified pattern xi , the following condition holds: |epi | = |Spi − tpi | ≤ η, p = 1, 2, . . . , C, for some η ∈ [0, 0.5). For any ρ ∈ (0, 1), there exists a sufficiently large scalar λ > 0 such that |ˆepi | = |Sˆpi − tpi | ≤ ρη, where Sˆpi is the output of a new network with weights (w, λv). Proposition 3 guarantees the existence of a new set of weights that still gives an accuracy rate of R with the same error tolerance in equation 2.14. It will be prudent to retrain the network to find this new set so that the weights of connections between the hidden units and output units v will not become too large. Simply multiplying v by some large λ will hinder the process of removing subsequent weights from the network. We summarize our pruning algorithm below. Algorithm N2P2F: Neural Network Pruning with Penalty Function 1. Let η1 and η2 be positive scalars such that η1 + η2 < 0.5. (η1 is the error tolerance; η2 is a threshold that determines if a weight can be removed). 2. Pick a fully connected network, and train this network to meet a prespecified accuracy level with the condition in equation 2.14 satisfied by all correctly classified input patterns. Let (w, v) be the weights of this network. 3. For each wm` in the network, if max |vpm wm` | ≤ 4η2 , p

(2.15)

then remove wm` from the network. 4. For each vpm in the network, if |vpm | ≤ 4η2 , then remove vpm from the network.

(2.16)

A Penalty-Function Approach for Pruning Feedforward Neural Networks 193

5. If no weight satisfies the conditions in equations 2.15 or 2.16, then for each wm` in the network, compute ωm` = max |vpm wm` |. p

Remove wm` with the smallest ωm` . 6. Retrain the network. If the classification rate of the network falls below the specified level, then stop and use the previous setting of network weights. Otherwise, go to Step 3. To reduce the retraining time, we remove all the weights that satisfy the conditions in equations 2.15 or 2.16 in Steps 3 and 4 of the algorithm. Although it can no longer be guaranteed that the accuracy of the retrained network will not drop, our experiments show that many weights can be eliminated simultaneously without deteriorating the performance of the network. This is especially true when the starting fully connected network has an excessive number of redundant hidden units. Because the two conditions in equations 2.15 and 2.16 for pruning depend on the magnitude of the weights for connections between the input units and the hidden units (see equation 2.15) and between the hidden units and the output units (see equation 2.16), it is imperative that during training these weights be prevented from getting too large. At the same time, small weights should be encouraged to decay rapidly to zero. To achieve these goals, we make use of both penalty functions in equations 2.4 and 2.5 to obtain the following function:

P(w, v) = ²1

h X

 n X 

m=1

`=1

+²2

h X

 

m=1

βw2m` 1 + βw2m`

n X `=1

w2m` +

+

C X

C X

2 βvpm

2 1 + βvpm 

 

p=1

2  vpm .

(2.17)

p=1

Hence, the complete error function to be minimized during the training process is

θ(w, v) = P(w, v) −

C k X X (tpi log Spi + (1 − tpi ) log(1 − Spi )).

(2.18)

i=1 p=1

The values for the weight decay parameters ²1 , ²2 > 0 must be chosen to reflect the relative importance of the accuracy of the network versus its complexity. These parameters also determine the range of values where the penalty for each weight in the network is approximately equal to ²1 .

194

Rudy Setiono

Figure 2: Plots of the function f (w) = ²1 βw2 /(1 + βw2 ) + ²2 w2 and its derivative. (a, b): ²1 = 0.1, ²2 = 10−5 , and β = 10. (c, d): ²1 = 0.1, ²2 = 10−6 , and β = 10.

Consider the plot of the function f (w) = ²1 βw2 /(1 + βw2 ) + ²2 w2 , where ²1 = 0.1, ²2 = 10−5 , and β = 10 in Figure 2a. This function intersects the horizontal line f = ²1 at w∗ ≈ ±5.62. If ν is a small positive scalar, then we can find the values of wL > 0 and wU > 0 such that for all values of w, |w| ∈ (wL , wU ), ²1 −ν ≤ f (w) ≤ ²1 +ν. This is the penalty corresponding to a nonzero connection in the network. In this interval, almost no distinction is made between the penalty for smaller or larger weights. For example, when ν = 10−2 , the penalty for any weight with magnitude in the interval (wL , wU ) = (0.95, 31.64) is within 10% of ²1 . By decreasing the value of ²2 , the interval over which the penalty value is approximately equal to ²1 can be made wider, as shown in Figure 2c, where ²2 = 10−6 . A weight is prevented from taking a value that is too large, since the quadratic term becomes dominant for values of w that are greater than wU . The value of

A Penalty-Function Approach for Pruning Feedforward Neural Networks 195

wL can be made arbitrarily small by increasing the value of the parameter β. The derivative of the function f (w) near zero is relatively large (Figs. 2b and 2d). This will give a small weight w stronger tendency to decay to zero. 3 Experimental Results We selected well-known problems to test the pruning algorithm described in the previous section: the contiguity problem, the 4-bit and 5-bit parity problems, and the monks problems. Since we were interested in finding the simplest network architecture that could solve these problems, all available input data were used for training and pruning. An exception was the monks problems. For the three monks problems, the division of the data sets into training and testing sets has been fixed (Thrun et al. 1991). There was one output unit in the networks. For all problems, thresholds or biases in the hidden units were incorporated into the network by appending a 1 to each input pattern. The threshold value of the output unit was set to 0 to simplify the implementation of the training and pruning algorithms. Only two different sets of parameters were involved in the error function. They corresponded to the weights of the connections between the input layer and the hidden layer, and between the hidden layer and the output layer. During training, the gradient of the error function was computed by taking partial derivatives of the function with respect to these parameters only. During pruning, conditions were checked to determine whether a connection could be removed. There was no special condition for checking whether a hidden unit bias could be removed. Note that if after pruning, there exists a hidden unit connected only to the input unit with a constant input value of 1 for all patterns and the output unit, then the weights of two arcs connected to this hidden unit determine a nonzero threshold value at the output unit. The function θ(w, v) (cf. equation 2.18) was minimized by a variant of the quasi-Newton algorithm, the BFGS method. It has been shown that quasiNewton algorithms, such as the BFGS method, can speed up the training process significantly (Watrous 1987). At each iteration of the algorithm, a positive definite matrix that is an approximation of the inverse of the Hessian of the function to be minimized is computed. The positive definiteness of this matrix ensures that a descent direction can be generated. Given a descent direction, a step size is computed via an inexact line search algorithm. Details of this algorithm can be found in Dennis and Schnabel (1983). For all the experiments reported here, we used the same values for the parameters involved in the function θ (w, v). These were ²1 = 0.1, ²2 = 10−5 , and β = 10. During the training process, the BFGS algorithm was terminated when the following condition was satisfied: k∇θ(w, v)k ≤ 10−8 max{1, kw, vk}, where k∇θ(w, v)k is the 2-norm of the gradient of θ (w, v). At this point the

196

Rudy Setiono

accuracy of the network was checked. The value of η1 (cf. equation 2.14) was 0.35. If the classification rate of the network met the prespecified required accuracy, then the pruning process was initiated. The required accuracy of the original fully connected network was 100% for all problems, except for the monks 3 problem. For the monks 3 problem, the acceptable accuracy rate was 95%. We did not attempt to get 100% accuracy rate for this problem due to the presence of six incorrect classifications (out of 122 patterns) in the training set. The parameter η2 used during the pruning process was set at 0.10. Typically, at the start of the pruning process, many weights would be eliminated because they satisfied the condition in equation 2.15. Subsequently, weights were removed one at a time based on the magnitude of the product |vpm wm` |. Before the actual pruning and retraining was done, the current network was saved. Should pruning additional weight(s) and retraining the network fail to give a new set of weights that met the prespecified accuracy requirement, the saved network would be considered the smallest network for this particular run of the experiment. 3.1 The Contiguity Problem. This problem has been the subject of several studies on network pruning (Thodberg 1991; Gorodkin et al. 1993). The input patterns were binary strings of length 10. Of all possible 1024 patterns, only those with two or three clumps were used. A clump was defined as a block of 1’s in the string separated by at least one 0. The target value ti for each of the 360 patterns with two clumps was 0, and the target value for each of the 432 patterns with three clumps was 1. A sparse symmetric network architecture with nine hidden units and a total of 37 weights has been suggested for this problem (Gorodkin et al. 1993). Two experiments were carried out using this data set as input. In the first experiment, 50 neural networks having six hidden units were used as the starting networks. In the second experiment, the same number of networks, each with nine hidden units, were used. The accuracy rate of all these networks, which were trained starting from different random initial weights, was 100%. The number of connections and hidden units left in the networks after pruning are tabulated in Table 1. Thodberg (1991) trained 38 fully connected networks with 15 hidden units using 100 patterns as training data. The trained networks were then pruned by a brute-force scheme, and the average number of connections left in the pruned network was 32. This number, however, did not include the biases in the networks. Employing the optimal brain damage pruning scheme, Gorodkin et al. (1993) obtained pruned networks with a number of weights ranging from 36 to more than 90. These networks were trained on 140 examples. In contrast, the maximum number of connections left in our networks was 33, and only one of the pruned networks had this many connections. The results of the experiments with different initial number of hidden units in the networks show the robustness of our pruning algorithm.

A Penalty-Function Approach for Pruning Feedforward Neural Networks 197 Table 1: Average Number of Connections and Hidden Units in 50 Networks Trained to Solve the Contiguity Problem after Pruning. Number of starting hidden units Number of connections before pruning Average number of connections after pruning Average number of hidden units after pruning

6

9

72 28.44 (1.33) 5.98 (0.14)

108 29.18 (1.32) 7.62 (0.75)

Note: Figures in parentheses indicate standard deviations.

Starting with six or nine hidden units, the networks were pruned until there were on average 29 connections left. 3.2 The Parity Problem. The parity problem is a well-known difficult problem that has often been used for testing the performance of a neural network training algorithm. The input set consists of 2N patterns in Ndimensional space, and each pattern is an N-bit binary vector. The target value ti is equal to 1 if the number of 1’s in the pattern is odd, and it is 0 otherwise. We used the 4-bit and 5-bit parity problems to test our pruning algorithm. For the 4-bit parity problem, three sets of experiments were conducted. In the first set, 50 networks each with four hidden units were used as the starting networks. In the second and third sets of experiments, the initial number of hidden units were five and six. Similarly, three sets of experiments were conducted for the 5-bit parity problem. The initial number of hidden units in 50 starting networks for each set of experiment were five, six, and seven, respectively. The average number of weights and hidden units of the networks after pruning are shown in Table 2. In this table, we also show the average number of hidden units left after pruning. Many experiments using backpropagation network assume that N hidden units are needed to solve the N-bit parity problem. Previously published neural network pruning algorithms have also failed to obtain networks with fewer than n hidden units (Chung and Lee 1992; Hanson and Pratt 1989). In fact, three hidden units are sufficient for both the 4-bit and 5-bit parity problems. 3.3 The Monks Problems. The monks problems (Thrun et al. 1991) are an artificial robot domain, in which robots are described by six different attributes: A1 : head shape ∈ round, square, octagon; A2 : body shape ∈ round, square, octagon;

198

Rudy Setiono

Table 2: Average Number of Connections and Hidden Units Obtained from 50 Networks for the 4-bit and 5-bit Parity Problems After Pruning. Parity 4 Number of starting hidden units Number of connections before pruning Average number of connections after pruning Average number of hidden units after pruning

4

5

6

24

30

36

17.40 (1.55)

18.12 (1.60)

18.76 (2.30)

3.42 (0.50)

3.64 (0.66)

3.98 (0.77)

5

6

7

35

42

49

21.80 (3.14)

21.84 (3.02)

22.34 (3.17)

3.58 (0.81)

3.68 (0.84)

3.82 (0.83)

Parity 5 Number of starting hidden units Number of connections before pruning Average number of connections after pruning Average number of hidden units after pruning

A3 : is smiling ∈ yes, no; A4 : holding ∈ sword, balloon, flag; A5 : jacket color ∈ red, yellow, green, blue; A6 : has tie ∈ yes, no. The learning tasks of the three monks problems are of binary classification, each given by the following logical description of a class. • Monks 1 problem: (head shape = body shape) or (jacket color = red). From 432 possible samples, 124 were randomly selected for the training set. • Monks 2 problem: Exactly two of the six attributes have their first value. From 432 samples, 169 were selected randomly for the training set. • Monks 3 problem: (Jacket color is green and holding a sword) or (jacket color is not blue and body shape is no octagon). From 432 samples, 122 were selected randomly for training and among them there were 5% misclassifications (i.e., noise in the training set).

A Penalty-Function Approach for Pruning Feedforward Neural Networks 199

The testing set for the three problems consisted of all 432 samples. Two sets of experiments were conducted on the monks 1 and monks 2 problems. In the first set, 50 networks, each with three hidden units, were used as the starting networks for pruning. In the second set, an equal number of networks with five hidden units were used. These networks correctly classified all patterns in the training set. For the monks 3 problem, in addition to the two sets of experiments, 50 networks with just one hidden unit were also trained as the starting networks. All starting networks for monks 3 had an accuracy rate of at least 95% on the training data. The required accuracy of the pruned networks for the monks 1 and monks 2 problems on the training patterns was 100%. For the monks 3 problem, it was 95%. The results from the experiments for monks 1 and monks 2 problems are summarized in Table 3. The results for the monks 3 problem are tabulated in Table 4. The P-values in these tables had been computed to test whether the accuracy of the networks on the testing set increased significantly after pruning. With the exception of the networks with one hidden unit for the monks 3 problem, the P-values clearly show that pruning did increase the predictive accuracy of the networks. There was not much difference in the average number of connections left after pruning between networks that originally had three hidden units and those with five hidden units. For the monks 3 problem, when the starting networks had only one hidden unit, the average number of connections left was 9.92, significantly fewer than the averages obtained from networks with 3 and 5 starting hidden units. Part of this difference can be accounted for by the number of hidden units still left in the pruned networks. Most of the networks with three or five starting hidden units still had three hidden units after pruning. The connections from these hidden units to the output unit added on average two more connections to the total. The smallest pruned networks with 100% accuracy on the training and the testing sets for the monks 1 and monks 2 problems had 11 and 15 connections, respectively. For the monks 3 problem, the pruning algorithm found a network with only 6 connections. This network was able to identify the 6 noise cases in the training set. The predicted outputs for these noise patterns were the opposite of the target outputs, and hence an accuracy rate of 95.08% was obtained on the training set. Since there was no noise in the testing data, the predictive accuracy rate was 100%. The network is depicted in Figure 3. For comparison, the backpropagation with weight decay and the cascade-correlation algorithms both achieved 97.2% accuracy on the testing set (Thrun et al. 1991). Some interesting observations on the relationship between the number of hidden units and the generalization capability of a network were made by Weigend (1993). One of these observations is that large networks perform better than smaller networks provided that the training of the networks is stopped early. To determine when the training should be terminated, a second data set—the cross-validation set—is used. Our results on the monks

200

Rudy Setiono

Table 3: Average Number of Connections and Predictive Accuracy Obtained from 50 Networks for the Monks 1 and Monks 2 Problems After Pruning. Monks 1 Number of starting hidden units Number of connections before pruning Average accuracy on testing set (%) Average number of connections after pruning Average accuracy on testing set (%) P-value

3

5

57 99.10 (2.66) 12.96 (1.63) 99.82 (0.74) 0.033

95 99.52 (1.44) 12.54 (1.05) 100.00 (0.00) 0.009

3

5

57 98.91 (1.47) 16.76 (1.81) 99.44 (0.78) 0.012

95 98.13 (2.32) 16.86 (1.40) 99.39 (0.70) 0.001

Monks 2 Number of starting hidden units Number of connections before pruning Average accuracy on testing set (%) Average number of connections after pruning Average accuracy on testing set (%) P-value

Note: P-values are computed for testing if the predictive accuracy rates of the networks increase significantly after pruning.

Figure 3: A network with six weights for the monks 3 problem. The number next to a connection shows the weight for that connection. The accuracy rates on the training set and the testing set are 95.08 and 100%, respectively.

A Penalty-Function Approach for Pruning Feedforward Neural Networks 201 Table 4: Average Number of Connections and Predictive Accuracy Obtained from 50 Networks for the Monks 3 Problem After Pruning. Monks 3 Number of starting hidden units Number of connections before pruning Average accuracy on training set (%) Average accuracy on testing set (%) Average number of connections after pruning Average accuracy on training set (%) Average accuracy on testing set (%) P-value

1

3

5

19

57

95

95.85 (0.69)

99.02 (1.09)

99.92 (0.38)

96.07 (2.43)

93.43 (1.91)

92.82 (1.53)

9.92 (2.06)

14.64 (2.79)

14.86 (2.19)

95.82 (0.69)

96.00 (1.07)

96.46 (1.12)

96.11 (2.55) 0.468

94.12 (2.95) 0.082

93.85 (2.21) 0.003

Note: P-values are computed for testing if the predictive accuracy rates of the networks increase significantly after pruning.

problems are mixed. For the monks 1 problem, networks with five hidden units have higher predictive accuracy than those with three hidden units. For the monks 2 problem, the reverse is true. For the monks 3 problem, where there is noise in the data, there is a trend that larger networks have poorer predictive accuracy rates than smaller ones, even after the networks have been pruned. Note that the statistics were obtained from networks that had been trained to reach a local minimum of the augmented error function. One clear conclusion that can be drawn from the data in Tables 3 and 4 is that pruned networks have better generalization capability than the fully connected networks. 4 Discussion and Final Remarks We presented a penalty function approach for feedforward neural network pruning. The penalty function consists of two components. The first component is to discourage the use of unnecessary connections, and the second component is to prevent the weights of these connections from taking excessively large values. Simple criteria for eliminating network connections are also given. The two components of the penalty function have been used individually in the past. However, applying our approach that combines the penalty function and the magnitude-based weight elimination criteria, we are able to get smaller networks than those reported in the literature. We

202

Rudy Setiono

have also experimented pruning with one of the two penalty parameters (²1 and ²2 ) set to zero; the results were not as good as those reported in Section 3. The approach described in this article works for networks with one hidden layer. However, it is possible to extend the algorithm so that it works for a network with N > 1 hidden layers. Let us label the input layer L0 , the output layer LN+1 , and the hidden layers L1 , L2 , . . . , LN . Connections are present only between units in layers Li and Li+1 . A criterion for removal of a connection between unit H1 in layer Li and unit H2 in layer Li+1 can be developed to make sure that the changes in the predicted outputs are not more than η2 . The effect of such a removal will be propagated layer by layer from H2 to the output units in LN+1 . Hence, the criterion for a removal will involve the weights from unit H2 to layer Li+2 and all the weights from layer Li+2 onward. The effectiveness of the pruning algorithm using our proposed penalty function has been demonstrated by its performance on three well-known problems. Using the same set of penalty parameters, networks that solve three test problems—the contiguity problem, the parity problems, and the monks problems—with few connections have been obtained. The small number of connections in the pruned networks allows us to develop an algorithm to extract compact rules from networks that have been trained to solve real-world classification problems (Setiono 1997).

References Ash, T. 1989. Dynamic node creation in backpropagation networks. Connection Science 1(4), 365–375. Chauvin, Y. 1989. A back-propagation algorithm with optimal use of hidden units. In Advances in Neural Information Processing Systems, Vol. 1, pp. 519– 526. Morgan Kaufmann, San Mateo, CA. Chung, F. L., and Lee, T. 1992. A node pruning algorithm for backpropagation networks. Int. J. Neural Systems 3(3), 301–314. Dennis, J. E., and Schnabel, R. B. 1983. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice Hall, Englewood Cliffs, NJ. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems, Vol. 2, pp. 524–532. Morgan Kaufmann, San Mateo, CA. Frean, M. 1990. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation 2(2), 198–209. Gorodkin, J., Hansen, L. K., Krogh, A., and Winther, O. 1993. A quantitative study of pruning by optimal brain damage. Int. J. Neural Systems 4(2), 159– 169. Grossman, S. I. 1995. Multivariable Calculus, Linear Algebra, and Differential Equations. Saunders College Publishing, Orlando, FL.

A Penalty-Function Approach for Pruning Feedforward Neural Networks 203 Hanson, S. J., and Pratt, L. Y. 1989. Comparing biases for minimal network construction with back-propagation. In Advances in Neural Information Processing Systems, Vol. 1, pp. 177–185. Morgan Kaufmann, San Mateo, CA. Hassibi, B., and Stork, D. G. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems, Vol. 5, pp. 164–171. Morgan Kaufmann, San Mateo, CA. Hertz, J., Krogh, A., and Palmer, R. G. 1991. In Introduction to the Theory of Neural Computation. Addision-Wesley, Redwood City, CA. Hinton, G. E. 1989. Connectionist learning procedures. Artificial Intelligence 40, 185–234. Karnin, E. D. 1990. A simple procedure for pruning back-propagation trained neural networks. IEEE Trans. Neural Networks 1(2), 239–242. Lang, K. J., and Witbrock, M. J. 1988. Learning to tell two spirals apart. In Proc. of the 1988 Connectionist Summer School, pp. 52–59. Morgan Kaufmann, San Mateo, CA. Le Cun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems, Vol. 2, pp. 598–605. Morgan Kaufmann, San Mateo, CA. Mezard, M., and Nadal, J. P. 1989. Learning in feedforward layered networks: The tiling algorithm. Journal of Physics A 22(12), 2191–2203. Mozer, M. C., and Smolensky, P. 1989. Skeletonization: A technique for trimming the fat from a network via relevance assestment. In Advances in Neural Information Processing Systems, Vol. 1, pp. 107–115. Morgan Kaufmann, San Mateo, CA. Setiono, R. 1997. Extracting rules from neural networks by pruning and hidden unit splitting. Neural Computation 9, 205–225. Tenorio, M. F., and Lee, W. 1990. Self-organizing network for optimum supervised learning. IEEE Trans. Neural Networks 1(1), 100–110. Thodberg, H. H. 1991. Improving generalization of neural networks through pruning. Int. J. Neural Systems 1(4), 317–326. Thrun, S. B., Bala, J., Bloedorn, E., Bratko, I., Cestnik, B., Cheng, J., De Jong, K., Dˇzeroski, S., Fahlman, S. E., Fisher, D., Hamann, R., Kaufman, K., Keller, S., Kononenko, I., Kreuziger, J., Michalski, R. S., Mitchell, T., Pachowicz, P., Reich, Y., Vafaie, H., Van de Welde, W., Wenzel, W., Wnek, J., and Zhang, J. 1991. The MONK’s problems—a performance comparison of different learning algorithms. Preprint CMU-CS-91-197. Carnegie Mellon University, Pittsburgh, PA. van Ooyen, A., and Nienhuis, B. 1992. Improving the convergence of the backpropagation algorithm. Neural Networks 5, 465–471. Watrous, R. 1987. Learning algorithms for connectionist networks: Applied gradient methods of nonlinear optimization. Proc. IEEE First International Conference on Neural Networks, pp. 619–627. IEEE Press, New York. Weigend, A. S. 1993. On overfitting and the effective number of hidden units. In Proc. of the 1993 Connectionist Models Summer School, pp. 335–342. Lawrence Erlbaum, Hillsdale, NJ.

204

Rudy Setiono

Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1990. Back-propagation, weight-elimination and time series prediction. In Proc. of the 1990 Connectionist Models Summer School, pp. 105–116. Morgan Kaufmann, San Mateo, CA.

Received November 23, 1994; accepted March 13, 1996.

Communicated by Jude Shavlik

Extracting Rules from Neural Networks by Pruning and Hidden-Unit Splitting Rudy Setiono Department of Information Systems and Computer Science, National University of Singapore, Kent Ridge, Singapore 0511, Republic of Singapore

An algorithm for extracting rules from a standard three-layer feedforward neural network is proposed. The trained network is first pruned not only to remove redundant connections in the network but, more important, to detect the relevant inputs. The algorithm generates rules from the pruned network by considering only a small number of activation values at the hidden units. If the number of inputs connected to a hidden unit is sufficiently small, then rules that describe how each of its activation values is obtained can be readily generated. Otherwise the hidden unit will be split and treated as output units, with each output unit corresponding to an activation value. A hidden layer is inserted and a new subnetwork is formed, trained, and pruned. This process is repeated until every hidden unit in the network has a relatively small number of input units connected to it. Examples on how the proposed algorithm works are shown using real-world data arising from molecular biology and signal processing. Our results show that for these complex problems, the algorithm can extract reasonably compact rule sets that have high predictive accuracy rates.

1 Introduction One of the most popular applications of feedforward neural networks is distinguishing patterns in two or more disjoint sets. The neural network approach has been applied to solve pattern classification problems in diverse areas such as finance, engineering, and medicine. While the predictive accuracy obtained by neural networks is usually satisfactory, it is often said that a neural network is practically a black box. Even for a network with only a single hidden layer, it is generally impossible to explain why a certain pattern is classified as a member of one set and another pattern as a member of another set, due to the complexity of the network. In many applications, however, it is desirable to have rules that explicitly state under what conditions a pattern is classified as a member of one set or another. Several approaches have been developed for extracting rules Neural Computation 9, 205–225 (1997)

c 1997 Massachusetts Institute of Technology °

206

Rudy Setiono

from a trained neural network. Saito and Nakano (1988) proposed a medical diagnostic expert system based on a multilayer neural network. They treated the network as a black box and used it only to observe the effects on the network output caused by changes in the inputs. Two methods for extracting rules from neural network are described in Towell and Shavlik (1993). The first method is the subset algorithm (Fu 1991), which searches for subsets of connections to a unit whose summed weight exceeds the bias of that unit. The second method, the MofN algorithm, clusters the weights of a trained network into equivalence classes. The complexity of the network is reduced by eliminating unneeded clusters and by setting all weights in each remaining cluster to the average of the cluster’s weights. Rules with weighted antecedents are obtained from the simplified network by translation of the hidden units and output units. Towell and Shavlik applied these two methods to knowledge-based neural networks that have been trained to recognize genes in DNA sequences. The topology and the initial weights of a knowledge-based network are determined using problem-specific a priori information. More recently, a method that uses sampling and queries was proposed (Craven and Shavlik 1994). Instead of searching for rules from the network, the problem of rule extraction is viewed as a learning task. The target concept is the function computed by the network, and the network input features are the inputs for the learning task. Conjunctive rules are extracted from the neural network with the help of two oracles. Thrun (1995) describes a rule-extraction algorithm that analyzes the inputoutput behavior of a network using validity interval analysis (VI). VI analysis divides the activation range of each network’s unit into intervals, such that all of the network’s activation values must lie within the intervals. The boundary of these intervals is obtained by solving linear programs. Two approaches of generating rule conjectures, specific-to-general and generalto-specific, are described. The validity of these conjectures is checked with VI analysis. In this article, we propose an algorithm to extract rules from a pruned network. We assume the network to be a standard feedforward backpropagation network with a single hidden layer that has been trained to meet a prespecified accuracy requirement. The assumption that the network has only a single hidden layer is not restrictive. Theoretical studies have shown that networks with a single hidden layer can approximate arbitrary decision boundaries (Hornik 1991). Experimental studies have also demonstrated the effectiveness of such networks (de Villiers and Barnard 1992). The process of extracting rules from a trained network can be made much easier if the complexity of the network has first been reduced. The pruning process attempts to eliminate as many connections as possible from the network while trying to maintain the prespecified accuracy rate. It is expected that fewer connections will result in more concise rules. No initial knowledge of the problem domain is required. Relevant and irrelevant attributes

Extracting Rules from Neural Networks

207

of the data are distinguished during the pruning process. Those that are relevant will be kept; others will be automatically discarded. A distinguishing feature of our algorithm is that the activation values at the hidden units are clustered into discrete values. When only a small number of inputs are feeding a hidden unit, it is not difficult to extract rules that describe how each of the discrete activation values is obtained. When many inputs are connected to a hidden unit, then its discrete activation values will be used as the target outputs of a new subnetwork. A new hidden layer is inserted between the inputs and this new output layer. The new network is then trained and pruned by the same pruning technique as in the original network. The process of splitting the hidden units and creating new networks is repeated until each hidden unit in the network has only a small number of inputs connected to it, say, no more than five. We describe our rule extraction algorithm in Section 2. A network that has been trained and pruned to solve the splice-junction problem (Lapedes et al. 1989) is used to illustrate in detail how the algorithm works. Each pattern in the data set for this problem is described by a 60-nucleotide-long DNA sequence; the learning task is to recognize the type of boundary at the center of the sequence. This boundary can be an exon-intron boundary, intron-exon boundary, or neither. In Section 3 we present the results of applying the proposed algorithm on a second problem, the sonar target classification problem (Gorman and Sejnowski 1988). The learning task is to distinguish between sonar returns from metal cylinders and those from cylindrically shaped rocks. Each pattern in this data set is described by a set of 60 real numbers between 0 and 1. The data sets for both the splicejunction and the sonar target classification problems were obtained via ftp from the University of California–Irvine repository (Murphy and Aha 1992). Discussion on the proposed algorithm and comparison with related work are presented in Section 4. Finally, a brief conclusion is given in Section 5.

2 Extracting Rules from a Pruned Network The algorithm for rule extraction from a pruned network basically consists of two main steps. The first step clusters the activation values of the hidden units into a small number of clusters. If the number of inputs connected to a hidden unit is relatively large, the second step of the algorithm splits the hidden unit into a number of units. The number of clusters found in the first step determines the number of new units. We form a new network by treating each new unit as an output unit and adding a new hidden layer. We train and prune this network and repeat the process if necessary. The process of forming a new network by treating a hidden unit as a new set of output units and adding a new hidden layer terminates when, after pruning, each hidden unit in the network has a small number of input units connected to it. The outline of the algorithm is as follows.

208

Rudy Setiono

Rule-extraction (RX) algorithm. 1. Train and prune the neural network. 2. Discretize the activation values of the hidden units by clustering. 3. Using the discretized activation values, generate rules that describe the network outputs. 4. For each hidden unit: • If the number of input connections is fewer than an upper bound UC , then extract rules to describe the activation values in terms of the inputs. • Else form a subnetwork: (a) Set the number of output units equal to the number of discrete activation values. Treat each discrete activation value as a target output. (b) Set the number of input units equal to the inputs connected to the hidden units. (c) Introduce a new hidden layer. Apply RX to this subnetwork. 5. Generate rules that relate the inputs and the outputs by merging rules generated in Steps 3 and 4. In Step 3 of algorithm RX, we assume that the number of hidden units and the number of clusters in each hidden unit are relatively small. If there are H hidden units in the pruned network, and Step 2 of RX finds Ci clusters in hidden unit i = 1, 2, . . . , H, then there are C1 × C2 . . . × CH possible combinations of clustered hidden-unit activation values. For each of these combinations, the network output is computed using the weights of the connections from the hidden units to the output units. Rules that describe the network outputs in terms of the clustered activation values are generated using the X2R algorithm (Liu and Tan 1995). The algorithm is designed to generate rules from small data sets with discrete attribute values. It chooses the pattern with the highest frequency, generates the shortest rule for this pattern by checking the information provided by one data attribute at a time, and repeats the process until rules that cover all the patterns are generated. The rules are then grouped according to the class labels, and redundant rules are eliminated by pruning. When there is no noise in the data, X2R generates rules with perfect accuracy on the training patterns. If the number of inputs of a hidden unit is fewer than UC , Step 4 of the algorithm also applies X2R to generate rules that describe the discrete activation values of that hidden unit in terms of the input units. Step 5 of RX merges the two sets of rules generated by X2R. This is done by substituting

Extracting Rules from Neural Networks

209

the conditions of the rules involving clustered activation values generated in Step 3 by the rules generated in Step 4. We shall illustrate how algorithm RX works using the splice-junction domain in the next section. 2.1 The Splice-Junction Problem. We trained a network to solve the splice-junction problem. This problem is a real-world problem arising in molecular biology that has been widely used as a test data set for knowledgebased neural networks training and rule-extraction algorithms (Towell and Shavlik 1993). The total number of patterns in the data set is 3175,1 with each pattern consisting of 60 attributes. The attribute corresponds to a DNA nucleotide and takes one of the following four values: G, T, C, or A. Each pattern is classified as IE (intron-exon boundary), EI (exon-intron boundary), or N (neither) according to the type of boundary at the center of the DNA sequence. Of the 3175 patterns, 1006 were used as training data. The four attribute values G, T, C, and A were coded as {1, 0, 0, 0}, {0, 1, 0, 0}, {0, 0, 1, 0}, and {0, 0, 0, 1}, respectively. Including the input for bias, 241 input units were present in the original neural network. The target output for each pattern in class IE was {1, 0, 0}, in class EI {0, 1, 0}, and in class N {0, 0, 1}. Five hidden units were sufficient to give the network more than 95% accuracy rate on the training data. We considered a pattern of type IE to be correctly classified as long as the largest activation value was found in the first output unit. Similarly, patterns of type EI and N were considered to be correctly classified if the largest activation value was obtained in the second and third hidden unit, respectively. The transfer functions used were the hyperbolic tangent function at the hidden layer and the sigmoid function at the output layer. The cross-entropy error measure was used as the objective function during training. To encourage weight decay, a penalty term was added to this objective function. The details of this penalty term and the pruning algorithm are described in Setiono (1997). Aiming to achieve a comparable accuracy rate as those reported in Towell and Shavlik (1993), we terminated the pruning algorithm when the accuracy of the network dropped below 92%. The resulting pruned network is depicted in Figure 1. There was a large reduction in the number of network connections. Of the original 1220 weights in the network, only 16 remained. Only 10 input units (including the bias input unit) remained out of the original 241 inputs. These inputs corresponded to 7 out of the 60 attributes present in the original data. Of the 5 original hidden units 2 were also removed. As we shall see in the next subsection, the number of input units and hidden units plays a crucial role

1 There are 15 more patterns in the data set that we did not use because one or more attributes has a value that is not G, T, C, or A.

210

Rudy Setiono

Figure 1: Pruned network for the splice-junction problem. Following convention, the original attributes have been numbered sequentially from −30 to −1 and 1 to 30. If the expression @n = X is true, then the corresponding input value is 1; otherwise it is 0. Table 1: Accuracy of the Pruned Network on the Training and Testing Data Sets.

Class

Training data Errors (total)

IE EI N Total

13 (243) 11 (228) 53 (535) 77 (1006)

Accuracy (%) 94.65 95.18 90.09 92.35

Testing data Errors (total) 32 (765) 42 (762) 150 (1648) 224 (3175)

Accuracy (%) 95.82 94.49 90.90 92.94

in determining the complexity of the rules extracted from the network. The accuracy of the pruned network is summarized in Table 1. 2.2 Clustering Hidden-Unit Activations. The range of the hyperbolic tangent function is the interval [−1, 1]. If this function is used as the transfer function at the hidden layer, a hidden-unit activation value may lie anywhere within this interval. However, it is possible to replace it by a discrete value without causing too much deterioration in the accuracy of the network. This is done by the following clustering algorithm.

Extracting Rules from Neural Networks

211

Hidden-Unit Activation-Values Clustering Algorithm 1. Let ² ∈ (0, 1). Let D be the number of discrete activation values in the hidden unit. Let δ1 be the activation value in the hidden unit for the first pattern in the training set. Let H(1) = δ1 , count(1) = 1, and sum(1) = δ1 ; set D = 1. 2. For each pattern pi , i = 2, 3, . . . , k in the training set: • Let δ be its activation value. • If there exists an index j¯ such that ¯ = |δ − H( j)|

min

j∈{1,2,...,D}

|δ − H(j)| and

¯ ≤ ², |δ − H( j)| ¯ := count( j) ¯ + 1, sum( j) ¯ := sum( j) ¯ +δ then set count( j) else D = D + 1, H(D) = δ, count(D) = 1, sum(D) = δ. 3. Replace H by the average of all activation values that have been clustered into this cluster: H(j) := sum(j)/count(j), j = 1, 2, . . . , D. Step 1 initializes the algorithm. The first activation value forms the first cluster. Step 2 checks whether subsequent activation values can be clustered into one of the existing clusters. The distance between an activation value ¯ is computed. If this under consideration and its nearest cluster, |δ − H( j)|, distance is less than ², then the activation value is clustered in cluster j.¯ Otherwise this activation value forms a new cluster. Once the discrete values of all hidden units have been obtained, the accuracy of the network is checked again with the activation values at the hidden units replaced by their discretized values. An activation value δ is ¯ where index j¯ is chosen such that j¯ = argmin |δ − H(j)|. If replaced by H( j), j the accuracy of the network falls below the required accuracy, then ² must be decreased and the clustering algorithm is run again. For a sufficiently small ², it is always possible to maintain the accuracy of the network with continuous activation values, although the resulting number of different discrete activations can be impractically large. The best ² value is one that gives a high accuracy rate after the clustering and at the same time generates as few clusters as possible. A simple way of obtaining an optimal value for ² is by searching in the interval (0, 1). The number of clusters and the accuracy of the network can be checked for all values of ² = iξ, i = 1, 2, . . . , where ξ is a small, positive scalar, such as 0.10. Note also that it is not necessary to fix the value of ² equal for all hidden units. For the pruned network depicted in Figure 1, we found the value of ² = 0.6 worked well for all three hidden units. The results of the clustering

212

Rudy Setiono

Table 2: Accuracy of the Pruned Network on the Training and Testing Data Sets with Discrete Activation Values at the Hidden Units. Class

Training data Errors (total)

IE EI N Total

16 (243) 8 (228) 52 (535) 76 (1006)

Accuracy (%) 93.42 96.49 90.28 92.45

Testing data Errors (total) 40 (765) 30 (762) 142 (1648) 212 (3175)

Accuracy (%) 94.77 96.06 91.38 93.32

algorithm were as follows: 1. Hidden unit 1. There were two discrete values: −0.04 and −0.99. Of the 1006 training data, 573 patterns had the first value, and 433 patterns had the second value. 2. Hidden unit 2: There were three discrete values: 0.96, −0.10, and −0.88. The distribution of the training data was 760, 72, and 174, respectively. 3. Hidden unit 3: There were two discrete values: 0.96 and 0. Of the 1006 training data, 520 patterns had the first value, and 486 patterns had the second value. The accuracy of the network with these discrete activation values is summarized in Table 2. With ² = 0.6, the accuracy was not the same as that achieved by the original network. In fact, it was slightly higher on both the training data and the testing data. It appears that a much smaller number of possible values for the hidden-unit activations reduces overfitting of the data. With a sufficiently small value for ², it is always possible to maintain the accuracy of a network after its activation values have been discretized. However, there is no guarantee that the accuracy can increase in general. Two discrete values at hidden unit 1, three at hidden unit 2, and two at hidden unit 3 produced 12 possible outputs for the network. These outputs are listed in Table 3. Let us define the notation α(i, j) to denote the hidden unit i taking its jth activation value. X2R generated the following rules that classify each of the predicted output classes: • If α(1, 1) and α(2, 1) and α(3, 1), then output = IE. • Else if α(2, 3), then output = EI. • Else if α(1, 1) and α(2, 2), then output = EI. • Else if α(1, 2) and α(2, 2) and α(3, 1), then output = EI.

Extracting Rules from Neural Networks

213

Table 3: Output of the Network with Discrete Activation Values at the Hidden Units. Hidden unit activations

Predicted output

Classification

1

2

3

1

2

3

−0.04 −0.04 −0.04 −0.04 −0.04 −0.04 −0.99 −0.99 −0.99 −0.99 −0.99 −0.99

0.96 0.96 −0.10 −0.10 −0.88 −0.88 0.96 0.96 −0.10 −0.10 −0.88 −0.88

0.96 0.00 0.96 0.00 0.96 0.00 0.96 0.00 0.96 0.00 0.96 0.00

0.44 0.44 0.44 0.44 0.44 0.44 0.00 0.00 0.00 0.00 0.00 0.00

0.01 0.01 0.62 0.62 0.99 0.99 0.01 0.01 0.62 0.62 0.99 0.99

0.20 0.99 0.00 0.42 0.00 0.02 0.91 1.00 0.06 0.97 0.00 0.43

IE N EI EI EI EI N N EI N EI EI

• Default output: N. Hidden unit 1 has only two inputs, one of which is the input for bias, and hidden unit 3 has only one input. Very simple rules in terms of the original attributes that describe the activation values of these hidden units were generated by X2R. The rules for these hidden units are: • If @-1=G, then α(1, 1); else α(1, 2). • If @-2=A, then α(3, 1); else α(3, 2). The expression @n=X is true implies that the corresponding input value is 1; otherwise the input value is 0. The seven input units feeding the second hidden unit can generate 64 possible instances. The three inputs @5 (= T, C, or A) can produce 4 different combinations—(0, 0, 0), (0, 0, 1), (0, 1, 0), or (1, 0, 0)—while the other 4 inputs can generate 16 different possible combinations. Rules that define how each of the three activation values of this hidden unit is obtained are not trivial to extract from the network. How this difficulty can be overcome is described in the next subsection. 2.3 Hidden-Unit Splitting and Creation of a Subnetwork. If we set the value of UC in algorithm RX to be less than 7, then in order to extract rules that describe how the three activation values of the second hidden unit are obtained, a subnetwork will be created.

214

Rudy Setiono

Since seven inputs determined the three activation values, the new network also had the same set of inputs. The number of output units corresponded to the number of activation values; in this case, three output units were needed. Each of the 760 patterns whose activation values were equal to 0.96 was assigned a target output of {1, 0, 0}. Patterns with activation values of −0.10 and −0.88 were assigned target outputs of {0, 1, 0} and {0, 0, 1}, respectively. In order to extract rules with the same accuracy rate as the network’s accuracy rate summarized in Table 2, the new network was trained to achieve a 100% accuracy rate. Five hidden units were sufficient to give the network this rate. The pruning process was terminated as soon as the accuracy dropped below 100%. When only 7 of the original 240 inputs were selected, many patterns had duplicates. To reduce the time to train and prune the new network, all duplicates were removed, and the 61 unique patterns left were used as training data. Since the network was trained to achieve a 100% rate, correct predictions of these 61 patterns guaranteed that the same rate of accuracy was obtained on all 1006 patterns. The pruned network is depicted in Figure 2. Three of the five hidden units remained after pruning. Of the original 55 connections in the fully connected network, 16 were still present. The most interesting aspect of the pruned network was that the activation values of the first hidden unit were determined solely by four inputs, while those of the second hidden unit were determined by the remaining three inputs. The activation values in the three hidden units were clustered. The value of ² = 0.2 was found to be small enough for the network with discrete activation values to maintain a 100% accuracy rate and large enough such that there were not too many different discrete values. The results were as follows: 1. Hidden unit 1: There were three discrete values: 0.98, 0.00, and −0.93. Of the original 1006 (61 training) patterns, 593 (39) patterns had the first value, 227 (15) the second value, and 186 (7) the third value. 2. Hidden unit 2: There were three discrete values: 0.88, 0.33, and −0.99. The distribution of the training data was 122 (8), 174 (8), and 710 (45), respectively. 3. Hidden unit 3: There was only one discrete value: 0.98. Since there were three different activation values at hidden units 1 and 2 and only one value at hidden unit 3, a total of nine possible outputs for the network were possible. These possible outputs are summarized in Table 4. In a similar fashion as before, let β(i, j) denote the hidden unit i taking its jth activation value. The following rules that classify each of the predicted output classes were generated by X2R from the data in Table 4: • If β(2, 3), then α(2, 1). • Else if β(1, 1) and β(2, 2), then α(2, 1).

Extracting Rules from Neural Networks

215

Figure 2: Pruned network for the second hidden unit of the original network depicted in Figure 1.

Table 4: Output of the Network with Discrete Activation Values at the Hidden Units for the Pruned Network in Figure 2.

Hidden unit activations

Predicted output

Classification

1

2

3

1

2

3

0.98 0.98 0.98 0.00 0.00 0.00 −0.93 −0.93 −0.93

0.88 0.33 −0.99 0.88 0.33 −0.99 0.88 0.33 −0.99

0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98

0.01 0.99 1.00 0.00 0.00 1.00 0.00 0.00 1.00

0.66 0.21 0.00 0.66 0.21 0.00 0.66 0.21 0.00

0.00 0.00 0.00 0.98 0.02 0.00 1.00 1.00 0.00

α(2, 2) α(2, 1) α(2, 1) α(2, 3) α(2, 2) α(2, 1) α(2, 3) α(2, 3) α(2, 1)

216

Rudy Setiono

• Else if β(1, 1) and β(2, 1), then α(2, 2). • Else if β(1, 2) and β(2, 2), then α(2, 2). • Default output: α(2, 3). X2R extracted rules, in terms of the original attributes, describing how each of the six different activation values of the two hidden units was obtained. These rules are given in Appendix A as level 3 rules. The complete rules generated from the splice-junction domain after the merging of rules in Step 5 of RX algorithm are also given in Appendix A. Since the subnetwork created for the second hidden unit of the original network had been trained to achieve a 100% accuracy rate and the value of ² used to cluster the activation values was also chosen to retain this level of accuracy, the accuracy rates of the rules were exactly the same as those of the network with discrete hidden-unit activation values listed in Table 2: 92.45% on the training data set and 93.32% on the test data set. The results of our experiment suggest that the extracted rules are able to mimic the network from which they are extracted perfectly. In general, if there is a sufficient number of discretized hidden-unit activation values and the subnetworks are trained and pruned to achieve a 100% accuracy rate, then the rules extracted will have the same accuracy rate as the network on the training data. It is possible to mimic the behavior of the pruned network with discrete activation values even on future patterns not in the training data set. Let S be the set of all patterns that can be generated by the inputs connected to a hidden unit. When the discrete activation values in the hidden unit are determined only by a subset of S , it is still possible to extract rules that will cover all possible instances that may be encountered in the future. For each pattern that is not represented in the training data, its continuous activation values are computed using the weights of the pruned network. Each of these activation values is assigned a discrete value in the hidden unit that is closest to it. Once the complete rules have been obtained, the network topology, the weights on the network connections, and the discrete hidden-unit activation values need not be retained. For our experiment on the splice-junction domain, we found that rules that cover all possible instances were generated without having to compute the activation values of patterns not already present in the training data. This was due to the relatively large number of patterns used during training and the small number of inputs found relevant for determining the hidden-unit activation values. 3 Experiment on a Data Set with Continuous Attributes In the previous section, we gave a detailed description of how rules can be extracted from a data set with only nominal attributes. In this section we shall illustrate how rules can also be extracted from a data set having

Extracting Rules from Neural Networks

217

Table 5: Relevant Attributes Found by ChiMerge and Their Subintervals. Attribute A35 A36 A44 A45 A47 A48 A49 A51 A52 A54 A55

Subintervals [0, 0.1300), [0.1300, 1] [0, 0.5070), [0.5070, 1] [0, 0.4280), [0.4280, 0.7760), [0.7760, 1] [0, 0.2810), [0.2810, 1] [0, 0.0610), [0.0610, 0.0910), [0.0910, 0.1530), [0.1530, 1] [0, 0.0762), [0.0762, 1] [0, 0.0453), [0.0453, 1] [0, 0.0215), [0.0215, 1] [0, 0.0054), [0.0054, 1] [0, 0.0226), [0.0226, 1] [0, 0.0047), [0.0047, 0.0057), [0.0057, 0.0064), [0.0064, 0.0127), [0.0127, 1]

continuous attributes. The sonar-returns classification problem (Gorman and Sejnowski 1988) was chosen for this purpose. The data set consisted of 208 sonar returns, each represented by 60 real numbers between 0.0 and 1.0. The task was to distinguish between returns from a metal cylinder and those from a cylindrically shaped rock. Returns from a metal cylinder were obtained at aspect angles spanning 90 degrees and those from rocks at aspect angles spanning 180 degrees. Gorman and Sejnowski (1988) used three-layer feedforward neural networks with a varying number of hidden units to solve this problem. Two series of experiments on this data set were reported: an aspect-angle independent series and an aspect-angle dependent series. In the aspect-angle independent series, the training patterns and the testing patterns were selected randomly from the total set of 208 patterns. In the aspect-angle dependent series, the training and the testing sets were carefully selected to contain patterns from all available aspect angles. It is not surprising that the performance of networks in the aspect-angle dependent series of experiments was better than those in the aspect-angle independent series. We chose the aspect-angle dependent data set to test the rule-extraction algorithm. The training set consisted of 49 sonar returns from metal cylinders and 55 sonar returns from rocks, and the testing set consisted of 62 sonar returns from metal cylinders and 42 sonar returns from rocks. In order to facilitate rule extraction, the numeric attributes of the data were first converted into discrete attributes. This was done by a modified version of the ChiMerge algorithm (Kerber 1992). The training patterns were first sorted according to the value of the attribute being discretized. Initial subintervals were formed by placing each unique value of the attribute in its own subinterval. The χ 2 value of adjacent intervals was computed, and the pairs of adjacent subintervals with the lowest χ 2 value were merged.

218

Rudy Setiono

In Kerber’s original algorithm, merging continues until all pairs of subintervals have χ 2 values exceeding a user-defined parameter χ 2 threshold. Instead of setting a fixed threshold value as a stopping condition for merging, we continued merging the data as long as there was no inconsistency in the discretized data. By inconsistency, we mean two or more patterns from different classes are assigned identical discretized values. Using this strategy, we found that many attributes values were merged into just one subinterval. Attributes with values that had been merged into one interval could be removed without introducing any inconsistency in the discretized data set. Let us denote the original 60 attributes by A1 , A2 , . . . , A60 . After discretization, eleven of these attributes were discretized into two or more discrete values. These attributes and the subintervals found by the ChiMerge algorithm are listed in Table 5. The thermometer coding scheme (Smith 1993) was used for the discretized attributes. Using this coding scheme, 28 binary inputs were needed. Let us denote these inputs by I1 , I2 , . . . , I28 . All patterns with attribute A35 less than 0.1300 were coded by I1 = 0, I2 = 1, while those with attribute value greater than or equal to 0.1300 were coded by I1 = 1, I2 = 1. The other 10 attributes were coded in similar fashion. A network with six hidden units and two output units was trained using the 104 binarized-feature training data set. One additional input for hidden unit thresholds was added to the network, giving a total of 29 inputs. After training was completed, network connections were removed by pruning. The pruning process was continued until the accuracy of the network dropped below 90%. The smallest network with an accuracy of more than 90% was saved for rule extraction. Two hidden units and 15 connections were present in the pruned network. Ten inputs— I1 , I3 , I8 , I11 , I12 , I14 , I20 , I22 , I25 , and I26 —were connected to the first hidden unit. Only one input, I17 , was connected the second hidden unit. Two connections connected each of the two hidden units to the two output units. The hidden-unit activation values clustering algorithm found three discrete activation values, −0.95, 0.94, and −0.04, at the first hidden unit. The number of training patterns having these activation values were 58, 45, and 1, respectively. At the second hidden unit there was only one value, −1. The overall accuracy of the network was 91.35%. All 49 metal cylinder returns in the training data were correctly classified, and 9 rock returns were incorrectly classified as metal cylinder returns. By computing the predicted outputs of the network with discretized hidden-unit activation values, we found that as long as the activation value of the first hidden unit equals −0.95, a pattern would be classified as a sonar return from metal cylinder. Because ten inputs were connected to this hidden unit, a new network was formed to allow us to extract rules. It was sufficient to distinguish between activation values of −0.95 and those not equal to −0.95, that is, 0.94 and −0.04. Hence the number of output units in

Extracting Rules from Neural Networks

219

Figure 3: Pruned network trained to distinguish between patterns having discretized activation values equal to −0.95 and those not equal to −0.95.

the new network was two. Samples with activation values equal to −0.95 were assigned target values of {1, 0}; all other patterns were assigned target values of {0, 1}. The number of hidden units was six. The pruned network is depicted in Figure 3. Note that this network’s accuracy rate was 100%; it correctly distinguished all 58 training patterns having activation values equal to −0.95 from those having activation values 0.94 or −0.04. In order to preserve this accuracy rate, six and three discrete activation values were required at the two hidden units left in the pruned network. At the first hidden unit, the activation values found were −1.00, −0.54, −0.12, 0.46, −0.76, and 0.11. At the second hidden unit, the values were −1.00, 1.00, and 0.00. All 18 possible combinations of these values were given as input to X2R. The target for each combination was computed using the weights on the connections from the hidden units to the output units of the network. Let β(i, j) denote hidden unit i taking its jth activation values. The following rules that describe the classification of the sonar returns were generated by X2R: • If β(1, 1), then output = Metal. • Else if β(2, 2), then output = Metal. • Else if β(1, 2) and β(2, 3), then output = Metal. • Else if β(1, 3) and β(2, 3), then output = Metal. • Else if β(1, 5), then output = Metal. • Default output: Rock.

220

Rudy Setiono

The activation values of the first hidden units were determined by five inputs: I1 , I8 , I12 , I25 , and I26 . Those of the second hidden units were determined by only four inputs: I3 , I11 , I14 , and I22 . Four activation values at the first hidden unit (−1.00, −0.54, −0.12, and −0.76) and two values at the second hidden unit (1.00 and 0.00) were involved in the rules that classified a pattern as a return from metal cylinders. The rules that determined these activation values were as follows: • Hidden unit 1: (initially, β(1, 1) = β(1, 2) = β(1, 3) = β(1, 5) = false): — — — — — —

If (I25 = 0) and (I26 = 1), then β(1, 1). If (I8 = 1) and (I25 = 1), then β(1, 2). If (I8 = 0) and (I12 = 1) and (I26 = 0), then β(1, 2). If (I8 = 0) and (I12 = 1) and (I25 = 1), then β(1, 3). If (I1 = 0) and (I12 = 0) and (I26 = 0), then β(1, 3). If (I8 = 1) and (I26 = 0), then β(1, 5).

• Hidden unit 2: (initially, β(2, 2) = β(2, 3) = false): — If (I3 = 0) and (I14 = 1), then β(2, 2). — If (I22 = 1), then β(2, 2). — If (I3 = 0) and (I11 = 0) and (I14 = 0) and (I22 = 0), then β(2, 3). With the thermometer coding scheme used to encode the original continuous data, it is easy to obtain rules in terms of the original attributes and their values. The complete set of rules generated for the sonar returns data set is given in Appendix B. A total of eight rules that classify a return as that from metal cylinder were generated. Two metal returns and three rock returns satisfied the conditions of one of these eight rules. With this rule removed, the remaining seven rules correctly classified 96 of the 104 training data (92.31%) and 101 of the 104 testing data (97.12%). 4 Discussion and Comparison with Related Work We have shown how rules can be extracted from a trained neural network without making any assumptions about the network’s activations or having initial knowledge about the problem domain. If some knowledge is available, however, it can always be incorporated into the network. For example, connections in the network from inputs thought to be not relevant can be given large penalty parameters during training, while those thought to be relevant can be given zero or small penalty parameters (Setiono 1997). Our algorithm does not require thresholded activation function to force the activation values to be zero or one (Towell and Shavlik 1993; Fu 1991), nor does it require the weights of the connections to be restricted in a certain range (Blassig 1994).

Extracting Rules from Neural Networks

221

Craven and Shavlik (1994) mentioned that one of the difficulties of the search-based method such as that of Saito and Nakano (1988) and Fu (1991) is the complexity of the rules. Some rules are found deep in the search space, and these search methods, having exponential complexity, may not be effective. Our algorithm, in contrast, finds intermediate rules embedded in the hidden units by training and pruning new subnetworks. The topology of any subnetwork is always the same: a three-layer feedforward one regardless of how deep these rules are in the search space. Each hidden unit, if necessary, is split into several output units of a new subnetwork. When more than one new network is created, each can be trained independently. As the activation values of a hidden unit are normally determined by a subset of the original inputs, we can substantially speed up the training of the subnetwork by using only the relevant inputs and removing duplicate patterns. The use of a standard three-layer network is another advantage of our method; any off-the-shelf training algorithm can be used. The accuracy and the number of rules generated by our algorithm are better than those obtained by C4.5 (Quinlan 1993), a popular machine learning algorithm that extracts rules from decision trees. For the splice-junction problem, C4.5’s accuracy rate on the testing data was 92.13%, and the number of rules generated was 43. Our algorithm achieved 1% higher accuracy (93.32%) with far fewer rules (10). The MofN algorithm achieved a 92.8% average accuracy rate, the number of rules was 20, and the number of conditions per rule was more than 100. These figures were obtained from 10 repetitions of tenfold cross-validation on a set of 1000 randomly selected training patterns (Towell and Shavlik 1993). Using 30% of the 3190 samples for training, 10% for cross-validation, and 60% for testing, the gradient descent symbolic rule generation was reported to achieve an accuracy of 93.25%. There were three intermediate rules for the hidden units and two rules for the two output units defining the classes EI and IE (Blassig 1994). For the sonar-return classification problem, the accuracy of the rules generated by our algorithm on the testing set was 2.88% higher (97.12% versus 94.24%) than that of C4.5. The number of rules generated by RX was only seven, three fewer than C4.5’s rules. 5 Conclusion Neural networks are often viewed as black boxes. Although their predictive accuracy is high, one usually cannot understand why a particular outcome is predicted. In this paper, we have attempted to open up these black boxes. Two factors make this possible. The first is a robust pruning algorithm. When redundant weights are eliminated, redundant input and hidden units are identified and removed from the network. Removal of these redundant units significantly simplifies the process of rule extraction and the extracted rules themselves. The second factor is the clustering of the hidden-unit activation values. The fact that the number of distinct activation values at

222

Rudy Setiono

the hidden units can be made small enough enables us to extract simple rules. An important feature of our rule-extraction algorithm is its recursive nature. When, after pruning, a hidden unit is still connected to a relatively large number of inputs, in order to generate rules that describe its activation values, a new network is formed. The rule-extraction algorithm is applied to a new network. By merging the rules extracted from the new network and the rules extracted from the original network, hierarchical rules are generated. The viability of the proposed method has been shown on two sets of real-world data. Compact sets of rules that achieved a high accuracy rate were extracted for these data sets. Appendix A The following rules are generated by the algorithm for the splice-junction domain (initial values for β(i, j) and α(i, j) are false): • Level 3 (rules that define the activation values of the subnetwork depicted in Fig. 2): 1. If (@4=A ∧ @5=C) ∨ (@46=A ∧ @5=G), then β(1, 2), else if (@4=A ∧ @5=G), then β(1, 3), else β(1, 1). 2. If (@1=G ∧ @2=T ∧ @3=A), then β(2, 1), else if (@1=G ∧ @2=T ∧ @36= A), then β(2, 2), else β(2, 3). • Level 2 (rules that define the activation values of the second hidden unit of the network depicted in Fig. 1): If β(2, 3) ∨ (β(1, 1) ∧ β(2, 2)), then α(2, 1), else if (β(1, 1) ∧ β(2, 1)) ∨ (β(1, 2) ∧ β(2, 2)), then α(2, 2), else α(2, 3). • Level 1 (rules extracted from the pruned network depicted in Fig. 1): If (@-1=G ∧ α(2, 1) ∧ @-2=A), then IE, else if α(2, 3) ∨ (@-1=G ∧ α(2, 2))∨ (@-16=G ∧ α(2, 2) ∧ @-2=A), then EI, else N. On removal of the intermediate rules in levels 2 and 3, we obtain the following equivalent set of rules: IE :- @-2 ’AGH----’. IE :- @-2 ’AG-V---’. IE :- @-2 ’AGGTBAT’.

Extracting Rules from Neural Networks

223

Table A.1: Accuracy of the Extracted Rules on the Splice-Junction Training and Testing Data Sets.

Class

Training data Errors (total)

IE EI N Total

17 (243) 10 (228) 49 (535) 76 (1006)

IE IE EI EI EI EI EI EI EI EI EI EI EI N.

:::::::::::::-

@-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2

Testing data

Accuracy (%) 93.00 95.61 90.84 92.45

’AGGTBAA’. ’AGGTBBH’. ’--GT-AG’. ’--GTABG’. ’--GTAAC’. ’-GGTAAW’. ’-GGTABH’. ’-GGTBBG’. ’-GGTBAC’. ’AHGTAAW’. ’AHGTABH’. ’AHGTBBG’. ’AHGTBAC’.

Errors (total) 48 (765) 40 (762) 134 (1648) 222 (3175)

Accuracy (%) 93.73 94.75 91.50 93.01

(*)

(*) (*) (*) (*) (*)

An extended version of standard Prolog notation has been used to express the rules concisely. For example, the first rule indicates that an A in position −2, a G in position −1, and an H in position 1 will result in the classification of the boundary type to be an intron-exon. Following convention, the letter B means C or G or T; W means A or T; H means A or C or T; and V means A or C or G. The character ‘-’ indicates any one of the four nucleotides A, C, G, or T. Each rule marked by (∗) classifies no more than three patterns out the 1006 patterns in the training set. The accuracy rates in Table A.1 were obtained using the unmarked 10 rules for classification. Appendix B The following rules were generated by the algorithm for the sonar-returns classification problem (original attributes are assumed to have been labeled

224

Rudy Setiono

Table B.1: Accuracy of the Extracted Rules on the Sonar-Returns Training and Testing Data Sets.

Class

Training data Errors (total)

Metal Rock Total

2 (49) 6 (55) 8 (104)

Accuracy (%) 95.92 89.09 92.31

Testing data Errors (total) 0 (62) 3 (42) 3 (104)

Accuracy (%) 100.00 92.86 97.12

A1, A2, . . . , A60): Metal Metal Metal Metal Metal

:::::-

A36 A54 A45 A55 A36 A47 A55 Metal :- A35 A48 Metal :- A36 A48 Metal :- A36 A47 A55 Rock.

< 0.5070 and A48 >= 0.0762. >= 0.0226. >= 0.2810 and A55 < 0.0057. < 0.0064 and A55 >= 0.0057. < 0.5070 and A45 < 0.2810 and A47 < 0.0910 and >= 0.0610 and A48 < 0.0762 and A54 < 0.0226 and < 0.0057. < 0.1300 and A36 < 0.5070 and A47 < 0.0610 and < 0.0762 and A54 < 0.0226 and A55 < 0.0057. < 0.5070 and A45 >= 0.2810 and A47 < 0.0910 and < 0.0762 and A54 < 0.0226 and A55 >= 0.0064. (*) < 0.5070 and A45 < 0.2810 and A47 < 0.0910 and >= 0.0610 and A48 < 0.0762 and A54 < 0.0226 and >= 0.0064. (**)

The rule marked (∗) does not classify any pattern in the training set. The conditions of the rule marked (∗∗) are satisfied by two metal returns and three rock returns in the training set. The accuracy rates in Table B.1 are obtained using the six unmarked rules. Acknowledgments I thank my colleague Huan Liu for providing the results of C4.5 and the two anonymous reviewers for their comments and suggestions, which greatly improved the presentation of this article.

Extracting Rules from Neural Networks

225

References Blassig, R. 1994. GDS: Gradient descent generation of symbolic classification rules. In Advances in Neural Information Processing Systems, Vol. 6, pp. 1093– 1100. Morgan Kaufmann, San Mateo, CA. Craven, M. W., and Shavlik, J. W. 1994. Using sampling and queries to extract rules from trained neural networks. In Proc. of the Eleventh International Conference on Machine Learning. Morgan Kaufmann, San Mateo, CA. de Villiers, J., and Barnard, E. 1992. Backpropagation neural nets with one and two hidden layers. IEEE Trans. on Neural Networks 4(1), 136–141. Fu, L. 1991. Rule learning by searching on adapted nets. In Proc. of the Ninth National Conference on Artificial Intelligence, pp. 590–595. AAAI Press/MIT Press, Menlo Park, CA. Gorman, R. P., and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75–89. Hornik, K. 1991. Approximation capabilities of multilayer feedforward neural networks. Neural Networks 4, 251–257. Kerber, R. 1992. Chi-merge: Discretization of numeric attributes. In Proc. of the Ninth National Conference on Artificial Intelligence, pp. 123–128. AAAI Press/MIT Press, Menlo Park, CA. Lapedes, A., Barnes, C., Burks, C., Farber, R., and Sirotkin, K. 1989. Application of neural networks and other machine learning algorithms to DNA sequence analysis. In Computers and DNA, pp. 157–182, Addison-Wesley, Redwood City, CA. Liu, H., and Tan, S. T. 1995. X2R: A fast rule generator. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, pp. 1631–1635. IEEE Press, New York. Murphy, P. M., and Aha, D. W. 1992. UCI repository of machine learning databases. Machine-readable data repository, University of California, Irvine, CA. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA. Saito, K., and Nakano, R. 1988. Medical diagnosis expert system based on PDP model. In Proc. IEEE International Conference on Neural Networks, pp. 1255– 1262. IEEE Press, New York. Setiono, R. 1997. A penalty-function approach for pruning feedforward neural networks. Neural Computation 9, 185–204. Smith, M. 1993. Neural networks for statistical modelling. Van Nostrand Reinhold, New York. Thrun, S. 1995. Extracting rules from artificial neural networks with distributed representations. In Advances in Neural Information Processing Systems, Vol. 7. MIT Press, Cambridge, MA. Towell, G. G., and Shavlik, J. W. 1993. Extracting refined rules from knowledgebased neural networks. Machine Learning 13(1), 71–101. Received November 23, 1994; accepted March 13, 1996.

REVIEW

Communicated by Geoffrey Hinton

Probabilistic Independence Networks for Hidden Markov Probability Models Padhraic Smyth Department of Information and Computer Science, University of California at Irvine, Irvine, CA 92697-3425 USA and Jet Propulsion Laboratory 525-3660, California Institute of Technology, Pasadena, CA 91109 USA

David Heckerman Microsoft Research, Redmond, WA 98052-6399 USA

Michael I. Jordan Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139 USA

Graphical techniques for modeling the dependencies of random variables have been explored in a variety of different areas, including statistics, statistical physics, artificial intelligence, speech recognition, image processing, and genetics. Formalisms for manipulating these models have been developed relatively independently in these research communities. In this paper we explore hidden Markov models (HMMs) and related structures within the general framework of probabilistic independence networks (PINs). The paper presents a self-contained review of the basic principles of PINs. It is shown that the well-known forward-backward (F-B) and Viterbi algorithms for HMMs are special cases of more general inference algorithms for arbitrary PINs. Furthermore, the existence of inference and estimation algorithms for more general graphical models provides a set of analysis tools for HMM practitioners who wish to explore a richer class of HMM structures. Examples of relatively complex models to handle sensor fusion and coarticulation in speech recognition are introduced and treated within the graphical model framework to illustrate the advantages of the general approach. 1 Introduction For multivariate statistical modeling applications, such as hidden Markov modeling (HMM) for speech recognition, the identification and manipulation of relevant conditional independence assumptions can be useful for model building and analysis. There has recently been a considerable amount Neural Computation 9, 227–269 (1997)

c 1997 Massachusetts Institute of Technology °

228

Padhraic Smyth, David Heckerman, and Michael I. Jordan

of work exploring the relationships between conditional independence in probability models and structural properties of related graphs. In particular, the separation properties of a graph can be directly related to conditional independence properties in a set of associated probability models. The key point of this article is that the analysis and manipulation of generalized HMMs (more complex HMMs than the standard first-order model) can be facilitated by exploiting the relationship between probability models and graphs. The major advantages to be gained are in two areas: • Model description. A graphical model provides a natural and intuitive medium for displaying dependencies that exist between random variables. In particular, the structure of the graphical model clarifies the conditional independencies in the associated probability models, allowing model assessment and revision. • Computational efficiency. The graphical model is a powerful basis for specifying efficient algorithms for computing quantities of interest in the probability model (e.g., calculation of the probability of observed data given the model). These inference algorithms can be specified automatically once the initial structure of the graph is determined. We will refer to both probability models and graphical models. Each consists of structure and parameters. The structure of the model consists of the specification of a set of conditional independence relations for the probability model or a set of (missing) edges in the graph for the graphical model. The parameters of both the probability and graphical models consist of the specification of the joint probability distribution: in factored form for the probability model and defined locally on the nodes of the graph for the graphical model. The inference problem is that of the calculation of posterior probabilities of variables of interest given observable data and a specification of the probabilistic model. The related task of maximum a posteriori (MAP) identification is the determination of the most likely state of a set of unobserved variables, given observed variables and the probabilistic model. The learning or estimation problem is that of determining the parameters (and possibly structure) of the probabilistic model from data. This article reviews the applicability and utility of graphical modeling to HMMs and various extensions of HMMs. Section 2 introduces the basic notation for probability models and associated graph structures. Section 3 summarizes relevant results from the literature on probabilistic independence networks (PINs), in particular, the relationships that exist between separation in a graph and conditional independence in a probability model. Section 4 interprets the standard first-order HMM in terms of PINs. In Section 5 the standard algorithm for inference in a directed PIN is discussed and applied to the standard HMM in Section 6. A result of interest is that the forward-backward (F-B) and Viterbi algorithms are shown to be special cases of this inference algorithm. Section 7 shows that the inference algo-

Probabilistic Independence Networks

229

rithms for undirected PINs are essentially the same as those already discussed for directed PINs. Section 8 introduces more complex HMM structures for speech modeling and analyzes them using the graphical model framework. Section 9 reviews known estimation results for graphical models and discusses their potential implications for practical problems in the estimation of HMM structures, and Section 10 contains summary remarks. 2 Notation and Background Let U = {X1 , X2 , . . . , XN } represent a set of discrete-valued random variables. For the purposes of this article we restrict our attention to discretevalued random variables; however, many of the results stated generalize directly to continuous and mixed sets of random variables (Lauritzen and Wermuth 1989; Whittaker 1990). P Let lowercase xi denote one of the values of variable Xi : the notation x1 is taken to mean the sum over all possible values of X1 . Let p(xi ) be shorthand for the particular probability p(Xi = xi ), whereas p(Xi ) represents the probability function for Xi (a table of values, since Xi is assumed discrete), 1 ≤ i ≤ N. The full joint distribution function is p(U) = p(X1 , X2 , . . . , XN ), and p(u) = (x1 , x2 , . . . , xN ) denotes a particular value assignment for U. Note that this full joint distribution p(U) = p(X1 , X2 , . . . , XN ) provides all the possible information one needs to calculate any marginal or conditional probability of interest among subsets of U. If A, B, and C are disjoint sets of random variables, the conditional independence relation A ⊥ B|C is defined such that A is independent of B given C, that is, p(A, B|C) = p(A|C)p(B|C). Conditional independence is symmetric. Note also that marginal independence (no conditioning) does not in general imply conditional independence, nor does conditional independence in general imply marginal independence (Whittaker 1990). With any set of random variables U we can associate a graph G defined as G = (V, E). V denotes the set of vertices or nodes of the graph such that there is a one-to-one mapping between the nodes in the graph and the random variables, that is, V = {X1 , X2 , . . . , XN }. E denotes the set of edges, {e(i, j)}, where i and j are shorthand for the nodes Xi and Xj , 1 ≤ i, j ≤ N. Edges of the form e(i, i) are not of interest and thus are not allowed in the graphs discussed in this article. An edge may be directed or undirected. Our convention is that a directed edge e(i, j) is directed from node i to node j, in which case we sometimes say that i is a parent of its child j. An ancestor of node i is a node that has as a child either i or another ancestor of i. A subset of nodes A is an ancestral set if it contains its own ancestors. A descendant of i is either a child of i or a child of a descendant of i. Two nodes i and j are adjacent in G if E contains the undirected or directed edge e(i, j). An undirected path is a sequence of distinct nodes {1, . . . , m} such that there exists an undirected or directed edge for each pair of nodes

230

Padhraic Smyth, David Heckerman, and Michael I. Jordan

{l, l+1} on the path. A directed path is a sequence of distinct nodes {1, . . . , m} such that there exists a directed edge for each pair of nodes {l, l + 1} on the path. A graph is singly connected if there exists only one undirected path between any two nodes in the graph. An (un)directed cycle is a path such that the beginning and ending nodes on the (un)directed path are the same. If E contains only undirected edges, then the graph G is an undirected graph (UG). If E contains only directed edges then the graph G is a directed graph (DG). Two important classes of graphs for modeling probability distributions that we consider in this paper are UGs and acyclic directed graphs (ADGs)— directed graphs having no directed cycles. We note in passing that there exists a theory for graphical independence models involving both directed and undirected edges (chain graphs, Whittaker 1990), but these are not discussed here. For a UG G, a subset of nodes C separates two other subsets of nodes A and B if every path joining every pair of nodes i ∈ A and j ∈ B contains at least one node from C. For ADGs, analogous but somewhat more complicated separation properties exist. A graph G is complete if there are edges between all pairs of nodes. A cycle in an undirected graph is chordless if none other than successive pairs of nodes in the cycle are adjacent. An undirected graph G is triangulated if and only if the only chordless cycles in the graph contain no more than three nodes. Thus, if one can find a chordless cycle of length four or more, G is not triangulated. A clique in an undirected graph G is a subgraph of G that is complete. A clique tree for G is a tree of cliques such that there is a one-to-one correspondence between the cliques of G and the nodes of the tree. 3 Probabilistic Independence Networks We briefly review the relation between a probability model p(U) = p(X1 , . . . , XN ) and a probabilistic independence network structure G = (V, E) where the vertices V are in one-to-one correspondence with the random variables in U. (The results in this section are largely summarized versions of material in Pearl 1988 and Whittaker 1990.) A PIN structure G is a graphical statement of a set of conditional independence relations for a set of random variables U. Absence of an edge e(i, j) in G implies some independence relation between Xi and Xj . Thus, a PIN structure G is a particular way of specifying the independence relationships present in the probability model p(U). We say that G implies a set of probability models p(U), denoted as PG , that is, p(U) ∈ PG . In the reverse direction, a particular model p(U) embodies a particular set of conditional independence assumptions that may or may not be representable in a consistent graphical form. One can derive all of the conditional independence

Probabilistic Independence Networks

231

Figure 1: An example of a UPIN structure G that captures a particular set of conditional independence relationships among the set of variables {X1 , . . . , X6 }—for example, X5 ⊥ {X1 , X2 , X4 , X6 } | {X3 }.

properties and inference algorithms of interest for U without reference to graphical models. However, as has been emphasized in the statistical and artificial intelligence literature, and as reiterated in this article in the context of HMMs, there are distinct advantages to be gained from using the graphical formalism. 3.1 Undirected Probabilistic Independence Networks (UPINs). A UPIN is composed of both a UPIN structure and UPIN parameters. A UPIN structure specifies a set of conditional independence relations for a probability model in the form of an undirected graph. UPIN parameters consist of numerical specifications of a particular probability model consistent with the UPIN structure. Terms used in the literature to describe UPINs of one form or another include Markov random fields (Isham 1981; Geman and Geman 1984), Markov networks (Pearl 1988), Boltzmann machines (Hinton and Sejnowski 1986), and log-linear models (Bishop et al. 1973). 3.1.1 Conditional Independence Semantics of UPIN Structures. Let A, B, and S be any disjoint subsets of nodes in an undirected graph G. G is a UPIN structure for p(U) if for any A, B, and S such that S separates A and B in G, the conditional independence relation A ⊥ B|S holds in p(U). The set of all conditional independence relations implied by separation in G constitutes the (global) Markov properties of G. Figure 1 shows a simple example of a UPIN structure for six variables. Thus, separation in the UPIN structure implies conditional independence in the probability model; i.e., it constrains p(U) to belong to a set of probability models PG that obey the Markov properties of the graph. Note that a complete UG is trivially a UPIN structure for any p(U) in the sense that

232

Padhraic Smyth, David Heckerman, and Michael I. Jordan

there are no constraints on p(U). G is a perfect undirected map for p if G is a UPIN structure for p(U) and all the conditional independence relations present in p(U) are represented by separation in G. For many probability models p there are no perfect undirected maps. A weaker condition is that a UPIN structure G is minimal for a probability model p(U) if the removal of any edge from G implies an independence relation not present in the model p(U); that is, the structure without the edge is no longer a UPIN structure for p(U). Minimality is not equivalent to perfection (for UPIN structures) since, for example, there exist probability models with independencies that cannot be represented as UPINs except for the complete UPIN structure. For example, consider that X and Y are marginally independent but conditionally dependent given Z (e.g., X and Y are two independent causal variables with a common effect Z). In this case the complete graph is the minimal UPIN structure for {X, Y, Z}, but it is not perfect because of the presence of an edge between X and Y. 3.1.2 Probability Functions on UPIN structures. Given a UPIN structure G, the joint probability distribution for U can be expressed as a simple factorization: p(u) = p(x1 , . . . , xN ) =

Y

aC (xC ),

(3.1)

VC

where VC is the set of cliques of G, xC represents an assignment of values to the variables in a particular clique C, and the aC (xC ) are non-negative clique functions. (The domain of each aC (xC ) is the set of possible assignments of values to the variables in the clique C, and the range of aC (xC ) is the semiinfinite interval [0, ∞).) The set of clique functions associated with a UPIN structure provides the numerical parameterization of the UPIN. A UPIN is equivalent to a Markov random field (Isham 1981). In the Markov random field literature the clique functions are generally referred to as potential functions. A related terminology, used in the context of the Boltzmann machine (Hinton and Sejnowski 1986), is that of energy function. The exponential of the negative energy of a configuration is a Boltzmann factor. Scaling each Boltzmann factor by the sum across Boltzmann factors (the partition function) yields a factorization of the joint density (the Boltzmann distribution), that is, a product of clique functions.1 The advantage of defining clique functions directly rather than in terms of the exponential of an energy function is that the range of the clique functions can be allowed 1 A Boltzmann machine is a special case of a UPIN in which the clique functions can be decomposed into products of factors associated with pairs of variables. If the Boltzmann machine is augmented to include “higher-order” energy terms, one for each clique in the graph, then we have a general Markov random field or UPIN, restricted to positive probability distributions due to the exponential form of the clique functions.

Probabilistic Independence Networks

233

Figure 2: A triangulated version of the UPIN structure G from Figure 1.

to contain zero. Thus equation 3.1 can represent configurations of variables having zero probability. A model p is said to be decomposable if it has a minimal UPIN structure G that is triangulated (see Fig. 2). A UPIN structure G is decomposable if G is triangulated. For the special case of decomposable models, G can be converted to a junction tree, which is a tree of cliques of G arranged such that the cliques satisfy the running intersection property, namely, that each node in G that appears in any two different cliques also appears in all cliques on the undirected path between these two cliques. Associated with each edge in the junction tree is a separator S, such that S contains the variables in the intersection of the two cliques that it links. Given a junction tree representation, one can factorize p(U) as the product of clique marginals over separator marginals (Pearl 1988): Q C∈V p(u) = Q C S∈VS

p(xC ) p(xS )

,

(3.2)

where p(xC ) and p(xS ) are the marginal (joint) distributions for the variables in clique C and separator S, respectively, and VC and VS are the set of cliques and separators in the junction tree. This product representation is central to the results in the rest of the article. It is the basis of the fact that globally consistent probability calculations on U can be carried out in a purely local manner. The mechanics of these local calculations will be described later. At this point it is sufficient to note that the complexity of the local inference algorithms scales as the sum of the sizes of the clique state-spaces (where a clique state-space is equal to the product over each variable in the clique of the number of states of each variable). Thus, local clique updating can make probability calculations on U much more tractable than using “brute force” inference if the model decomposes into relatively small cliques.

234

Padhraic Smyth, David Heckerman, and Michael I. Jordan

Many probability models of interest may not be decomposable. However, we can define a decomposable cover G0 for p such that G0 is a triangulated, but not necessarily minimal, UPIN structure for p. Since any UPIN G can be triangulated simply by the addition of the appropriate edges, one can always identify at least one decomposable cover G0 . However, a decomposable cover may not be minimal in that it can contain edges that obscure certain independencies in the model p; for example, the complete graph is a decomposable cover for all possible probability models p. For efficient inference, the goal is to find a decomposable cover G0 such that G0 contains as few extra edges as possible over the original UPIN structure G. Later we discuss a specific algorithm for finding decomposable covers for arbitrary PIN structures. All singly connected UPIN structures imply probability models PG that are decomposable. Note that given a particular probability model p and a UPIN G for p, the process of adding extra edges to G to create a decomposable cover does not change the underlying probability model p; that is, the added edges are a convenience for manipulating the graphical representation, but the underlying numerical probability specifications remain unchanged. An important point is that decomposable covers have the running intersection property and thus can be factored as in equation 3.2. Thus local clique updating is also possible with nondecomposable models by this conversion. Once again, the complexity of such local inference scales with the sum of the size of the clique state-spaces in the decomposable cover. In summary, any UPIN structure can be converted to a junction tree permitting inference calculations to be carried out purely locally on cliques. 3.2 Directed Probabilistic Independence Networks (DPINs). A DPIN is composed of both a DPIN structure and DPIN parameters. A DPIN structure specifies a set of conditional independence relations for a probability model in the form of a directed graph. DPIN parameters consist of numerical specifications of a particular probability model consistent with the DPIN structure. DPINs are referred to in the literature using different names, including Bayes network, belief network, recursive graphical model, causal (belief) network, and probabilistic (causal) network. 3.2.1 Conditional Independence Semantics of DPIN Structures. A DPIN structure is an ADG GD = (V, E) where there is a one-to-one correspondence between V and the elements of the set of random variables U = {X1 , . . . , XN }. It is convenient to define the moral graph GM of GD as the undirected graph obtained from GD by placing undirected edges between all nonadjacent parents of each node and then dropping the directions from the remaining directed edges (see Fig. 3b for an example). The term moral was coined to denote the “marrying” of “unmarried” (nonadjacent) parents. The motivation behind this procedure will become clear when we discuss the differ-

Probabilistic Independence Networks

235

(a)

(b)

Figure 3: (a) A DPIN structure GD that captures a set of independence relationships among the set {X1 , . . . , X5 }—for example, X4 ⊥ X1 |X2 . (b) The moral graph GM for GD , where the parents of X4 have been linked.

ences between DPINs and UPINs in Section 3.3. We shall also see that this conversion of a DPIN into a UPIN is a convenient way to solve DPIN inference problems by “transforming” the problem into an undirected graphical setting and taking advantage of the general theory available for undirected graphical models. We can now define a DPIN as follows. Let A, B, and S be any disjoint subsets of nodes in GD . GD is a DPIN structure for p(U) if for any A, B, and S such that S separates A and B in GD , the conditional independence relation A ⊥ B|S holds in p(U). This is the same definition as for a UPIN structure except that separation has a more complex interpretation in the directed context: S separates A from B in a directed graph if S separates A from B in the moral (undirected) graph of the smallest ancestral set containing A, B, and S (Lauritzen et al. 1990). It can be shown that this definition of a DPIN structure is equivalent to the more intuitive statement that given the values of its parents, a variable Xi is independent of all other nodes in the directed graph except for its descendants.

236

Padhraic Smyth, David Heckerman, and Michael I. Jordan

Thus, as with a UPIN structure, the DPIN structure implies certain conditional independence relations, which in turn imply a set of probability models p ∈ PGD . Figure 3a contains a simple example of a DPIN structure. There are many possible DPIN structures consistent with a particular probability model p(U), potentially containing extra edges that hide true conditional independence relations. Thus, one can define minimal DPIN structures for p(U) in a manner exactly equivalent to that of UPIN structures: Deletion of an edge in a minimal DPIN structure GD implies an independence relation that does not hold in p(U) ∈ PGD . Similarly, GD is a perfect DPIN structure G for p(U) if GD is a DPIN structure for p(U) and all the conditional independence relations present in p(U) are represented by separation in GD . As with UPIN structures, minimal does not imply perfect for DPIN structures. For example, consider the independence relations X1 ⊥ X4 |{X2 , X3 } and X2 ⊥ X3 |{X1 , X4 }: the minimal DPIN structure contains an edge from X3 to X2 (see Fig. 4b). A complete ADG is trivially a DPIN structure for any probability model p(U). 3.2.2 Probability Functions on DPINs. A basic property of a DPIN structure is that it implies a direct factorization of the joint probability distribution p(U): N Y p(xi | pa(xi )), (3.3) p(u) = i=1

where pa(xi ) denotes a value assignment for the parents of Xi . A probability model p can be written in this factored form in a trivial manner by the conditioning rule. Note that a directed graph containing directed cycles does not necessarily yield such a factorization, hence the use of ADGs. 3.3 Differences between Directed and Undirected Graphical Representations. It is an important point that directed and undirected graphs possess different conditional independence semantics. There are common conditional independence relations that have perfect DPIN structures but no perfect UPIN structures, and vice versa (see Figure 4 for examples). Does a DPIN structure have the same Markov properties as the UPIN structure obtained by dropping all the directions on the edges in the DPIN structure? The answer is yes if and only if the DPIN structure contains no subgraphs where a node has two or more nonadjacent parents (Whittaker 1990; Pearl et al. 1990). In general, it can be shown that if a UPIN structure G for p is decomposable (triangulated), then it has the same Markov properties as some DPIN structure for p. On a more practical level, DPIN structures are frequently used to encode causal information, that is, to represent the belief formally that Xi precedes Xj in some causal sense (e.g., temporally). DPINs have found application in causal modeling in applied statistics and artificial intelligence. Their popularity in these fields stems from the fact that the joint probability model can

Probabilistic Independence Networks

237

(a)

(b)

Figure 4: (a) The DPIN structure to encode the fact that X3 depends on X1 and X2 but X1 ⊥ X2 . For example, consider that X1 and X2 are two independent coin flips and that X3 is a bell that rings when the flips are the same. There is no perfect UPIN structure that can encode these dependence relationships. (b) A UPIN structure that encodes X1 ⊥ X4 |{X2 , X3 } and X2 ⊥ X3 |{X1 , X4 }. There is no perfect DPIN structure that can encode these dependencies.

be specified directly by equation 3.3, that is, by the specification of conditional probability tables or functions (Spiegelhalter el al. 1991). In contrast, UPINs must be specified in terms of clique functions (as in equation 3.1), which may not be as easy to work with (cf. Geman and Geman 1984, Modestino and Zhang 1992, and Vandermeulen et al. 1994 for examples of ad hoc design of clique functions in image analysis). UPINs are more frequently used in problems such as image analysis and statistical physics where associations are thought to be correlational rather than causal. 3.4 From DPINs to (Decomposable) UPINs. The moral UPIN structure GM (obtained from the DPIN structure GD ) does not imply any new independence relations that are not present in GD . As with triangulation, however, the additional edges may obscure conditional independence relations implicit in the numeric specification of the original probability model p associated with the DPIN structure GD . Furthermore, GM may not be triangulated (decomposable). By the addition of appropriate edges, the moral graph can be converted to a (nonunique) triangulated graph G0 , namely, a decomposable cover for GM . In this manner, for any probability model p for which GD is a DPIN structure, one can construct a decomposable cover G0 for p. This mapping from DPIN structures to UPIN structures was first discussed in the context of efficient inference algorithms by Lauritzen and

238

Padhraic Smyth, David Heckerman, and Michael I. Jordan

Spiegelhalter (1988). The advantage of this mapping derives from the fact that analysis and manipulation of the resulting UPIN are considerably more direct than dealing with the original DPIN. Furthermore, it has been shown that many of the inference algorithms for DPINs are in fact special cases of inference algorithms for UPINs and can be considerably less efficient (Shachter et al. 1994). 4 Modeling HMMs as PINs 4.1 PINs for HMMs. In hidden Markov modeling problems (Baum and Petrie 1966; Poritz 1988; Rabiner 1989; Huang et al. 1990; Elliott et al. 1995) we are interested in the set of random variables U = {H1 , O1 , H2 , O2 , . . . , HN−1 , ON−1 , HN , ON }, where Hi is a discrete-valued hidden variable at index i, and Oi is the corresponding discrete-valued observed variable at index i, 1 ≤ i ≤ N. (The results here can be directly extended to continuous-valued observables.) The index i denotes a sequence from 1 to N, for example, discrete time steps. Note that Oi is considered univariate for convenience: the extension to the multivariate case with d observables is straightforward but is omitted here for simplicity since it does not illuminate the conditional independence relationships in the HMM. The well-known simple first-order HMM obeys the following two conditional independence relations: Hi ⊥ {H1 , O1 , . . . , Hi−2 , Oi−2 , Oi−1 } | Hi−1 ,

3≤i≤N

(4.1)

and Oi ⊥ {H1 , O1 , . . . , Hi−1 , Oi−1 } | Hi ,

2 ≤ i ≤ N.

(4.2)

We will refer to this “first-order” hidden Markov probability model as HMM(1,1): the notation HMM(K, J) is defined such that the hidden state of the model is represented via the conjoined configuration of J underlying random variables and such that the model has state memory of depth K. The notation will be clearer in later sections when we discuss specific examples with K, J > 1. Construction of a PIN for HMM(1,1) is simple. In the undirected case, assumption 1 requires that each state Hi is connected to Hi−1 only from the set {H1 , O1 , . . . , Hi−2 , Oi−2 , Hi−1 , Oi−1 }. Assumption 2 requires that Oi is connected only to Hi . The resulting UPIN structure for HMM(1,1) is shown in Figure 5a. This graph is singly connected and thus implies a decomposable probability model p for HMM(1,1), where the cliques are of the form {Hi , Oi } and {Hi−1 , Hi } (see Fig. 5b). In Section 5 we will see how the joint probability function can be expressed as a product function on the junction tree, thus leading to a junction tree definition of the familiar F-B and Viterbi inference algorithms.

Probabilistic Independence Networks

239

(a)

(b)

Figure 5: (a) PIN structure for HMM(1,1). (b) A corresponding junction tree.

For the directed case, the connectivity for the DPIN structure is the same. It is natural to choose the directions on the edges between Hi−1 and Hi as going from i − 1 to i (although the reverse direction could also be chosen without changing the Markov properties of the graph). The directions on the edges between Hi and Oi must be chosen as going from Hi to Oi rather than in the reverse direction (see Figure 6a). In reverse (see Fig. 6b) the arrows would imply that Oi is marginally independent of Hi−1 , which is not true in the HMM(1,1) probability model. The proper direction for the edges implies the correct relation, namely, that Oi is conditionally independent of Hi−1 given Hi . The DPIN structure for HMM(1,1) does not possess a subgraph with nonadjacent parents. As stated earlier, this implies that the implied independence properties of the DPIN structure are the same as those of the corresponding UPIN structure obtained by dropping the directions from the edges in the DPIN structure, and thus they both result in the same junction tree structure (see Fig. 5b). Thus, for the HMM(1,1) probability model, the minimal directed and undirected graphs possess the same Markov properties; they imply the same conditional independence relations. Furthermore, both PIN structures are perfect maps for the directed and undirected cases, respectively.

240

Padhraic Smyth, David Heckerman, and Michael I. Jordan

(a)

(b)

Figure 6: DPIN structures for HMM(1,1). (a) The DPIN structure for the HMM(1,1) probability model. (b) A DPIN structure that is not a DPIN structure for the HMM(1,1) probability model.

4.2 Inference and MAP Problems in HMMs. In the context of HMMs, the most common inference problem is the calculation of the likelihood of the observed evidence given the model, that is, p(o1 , . . . , oN |model), where the o1 , . . . , oN denote observed values for O1 , . . . , ON . (In this section we will assume that we are dealing with one particular model where the structure and parameters have already been determined, and thus we will not explicitly indicate conditioning on the model.) The “brute force” method for obtaining this probability would be to sum out the unobserved state variables from the full joint probability distribution: X p(H1 , o1 , . . . , HN , oN ), (4.3) p(o1 , . . . , oN ) = h1 ,...,hN

where hi denotes the possible values of hidden variable Hi . In general, both of these computations scale as mN where m is the number of states for each hidden variable. In practice, the F-B algorithm (Poritz 1988; Rabiner 1989) can perform these inference calculations with much lower complexity, namely, Nm2 . The likelihood of the observed evidence

Probabilistic Independence Networks

241

can be obtained with the forward step of the F-B algorithm: calculation of the state posterior probabilities requires both forward and backward steps. The F-B algorithm relies on a factorization of the joint probability function to obtain locally recursive methods. One of the key points in this article is that the graphical modeling approach provides an automatic method for determining such local efficient factorizations, for an arbitrary probabilistic model, if efficient factorizations exist given the conditional independence (CI) relations specified in the model. The MAP identification problem in the context of HMMs involves identifying the most likely hidden state sequence given the observed evidence. Just as with the inference problem, the Viterbi algorithm provides an efficient, locally recursive method for solving this problem with complexity Nm2 , and again, as with the inference problem, the graphical modeling approach provides an automatic technique for determining efficient solutions to the MAP problem for arbitrary models, if an efficient solution is possible given the structure of the model. 5 Inference and MAP Algorithms for DPINs Inference and MAP algorithms for DPINs and UPINS are quite similar: the UPIN case involves some subtleties not encountered in DPINs, and so discussion of UPIN inference and MAP algorithms is deferred until Section 7. The inference algorithm for DPINs (developed by Jensen et al. 1990, and hereafter referred to as the JLO algorithm) is a descendant of an inference algorithm first described by Lauritzen and Spiegelhalter (1988). The JLO algorithm applies to discrete-valued variables: extensions to the JLO algorithm for gaussian and gaussian-mixture distributions are discussed in Lauritzen and Wermuth (1989). A closely related algorithm to the JLO algorithm, developed by Dawid (1992a), solves the MAP identification problem with the same time complexity as the JLO inference algorithm. We show that the JLO and Dawid algorithms are strict generalizations of the well-known F-B and Viterbi algorithms for HMM(1,1) in that they can be applied to arbitrarily complex graph structures (and thus a large family of probabilistic models beyond HMM(1,1)) and handle missing values, partial inference, and so forth in a straightforward manner. There are many variations on the basic JLO and Dawid algorithms. For example, Pearl (1988) describes related versions of these algorithms in his early work. However, it can be shown (Shachter et al. 1994) that all known exact algorithms for inference on DPINs are equivalent at some level to the JLO and Dawid algorithms. Thus, it is sufficient to consider the JLO and Dawid algorithms in our discussion as they subsume other graphical inference algorithms.2 2

An alternative set of computational formalisms is provided by the statistical physics

242

Padhraic Smyth, David Heckerman, and Michael I. Jordan

The JLO and Dawid algorithms operate as a two-step process: 1. The construction step. This involves a series of substeps where the original directed graph is moralized and triangulated, a junction tree is formed, and the junction tree is initialized. 2. The propagation step. The junction tree is used in a local messagepassing manner to propagate the effects of observed evidence, that is, to solve the inference and MAP problems. The first step is carried out only once for a given graph. The second (propagation) step is carried out each time a new inference for the given graph is requested. 5.1 The Construction Step of the JLO Algorithm: From DPIN Structures to Junction Trees. We illustrate the construction step of the JLO algorithm using the simple DPIN structure, GD , over discrete variables U = {X1 , . . . , X6 } shown in Figure 7a. The JLO algorithm first constructs the moral graph GM (see Fig. 7b). It then triangulates the moral graph GM to obtain a decomposable cover G0 (see Fig. 7c). The algorithm operates in a simple, greedy manner based on the fact that a graph is triangulated if and only if all of its nodes can be eliminated, where a node can be eliminated whenever all of its neighbors are pairwise-linked. Whenever a node is eliminated, it and its neighbors define a clique in the junction tree that is eventually constructed. Thus, we can triangulate a graph and generate the cliques for the junction tree by eliminating nodes in some order, adding links if necessary. If no node can be eliminated without adding links, then we choose the node that can be eliminated by adding the links that yield the clique with the smallest state-space. After triangulation, the JLO algorithm constructs a junction tree from G0 (i.e., a clique tree satisfying the running intersection property). The junction tree construction is based on the following fact: Define the weight of a link literature, where undirected graphical models in the form of chains, trees, lattices, and “decorated” variations on chains and trees have been studied for many years (see, e.g., Itzykson and Drouff´e 1991). The general methods developed there, notably the transfer matrix formalism (e.g., Morgenstern and Binder 1983), support exact calculations on general undirected graphs. The transfer matrix recursions and the calculations in the JLO algorithm are closely related, and a reasonable hypothesis is that they are equivalent formalisms. (The question does not appear to have been studied in the general case, although see Stolorz 1994 and Saul and Jordan 1995 for special cases.) The appeal of the JLO framework, in the context of this earlier literature on exact calculations, is the link that it provides to conditional probability models (i.e., directed graphs) and the focus on a particular data structure—the junction tree—as the generic data structure underlying exact calculations. This does not, of course, diminish the potential importance of statistical physics methodology in graphical modeling applications. One area where there is clearly much to be gained from links to statistical physics is the area of approximate calculations, where a wide variety of methods are available (see, e.g., Swendsen and Wang 1987).

Probabilistic Independence Networks

243

(a)

(b)

(c)

(d)

Figure 7: (a) A simple DPIN structure GD . (b) The corresponding (undirected) moral graph GM . (c) The corresponding triangulated graph G0 . (d) The corresponding junction tree.

between two cliques as the number of variables in their intersection. Then a tree of cliques will satisfy the running intersection property if and only if it is a spanning tree of maximal weight. Thus, the JLO algorithm constructs a junction tree by choosing successively a link of maximal weight unless it creates a cycle. The junction tree constructed from the cliques defined by the DPIN structure triangulation in Figure 7c is shown in Figure 7d. The worst-case complexity is O(N3 ) for the triangulation heuristic and O(N2 log N) for the maximal spanning tree portion of the algorithm. This construction step is carried out only once as an initial step to convert the original graph to a junction tree representation. 5.2 Initializing the Potential Functions in the Junction Tree. The next step is to take the numeric probability specifications as defined on the directed graph GD (see equation 3.3) and convert this information into the general form for a junction tree representation of p (see equation 3.2). This is achieved by noting that each variable Xi is contained in at least one clique in the junction tree. Assign each Xi to just one such clique, and for each clique

244

Padhraic Smyth, David Heckerman, and Michael I. Jordan

define the potential function aC (C) to be either the product of p(Xi |pa(Xi )) over all Xi assigned to clique C, or 1 if no variables are assigned to that clique. Define the separator potentials (in equation 3.2) to be 1 initially. In the section that follows, we describe the general JLO algorithm for propagating messages through the junction tree to achieve globally consistent probability calculations. At this point it is sufficient to know that a schedule of local message passing can be defined that converges to a globally consistent marginal representation for p; that is, the potential on any clique or separator is the marginal for that clique or separator (the joint probability function). Thus, via local message passing, one can go from the initial potential representation defined above to a marginal representation: Q C∈V p(u) = Q C

p(xC )

S∈VS p(xS )

.

(5.1)

At this point the junction tree is initialized. This operation in itself is not that useful; of more interest is the ability to propagate information through the graph given some observed data and the initialized junction tree (e.g., to calculate the posterior distributions of some variables of interest). From this point onward we will implicitly assume that the junction tree has been initialized as described so that the potential functions are the local marginals. 5.3 Local Message Propagation in Junction Trees Using the JLO Algorithm. In general p(U) can be expressed as Q C∈V p(u) = Q C

aC (xC )

S∈VS bS (xS )

,

(5.2)

where the aC and bS are nonnegative potential functions (the potential functions could be the initial marginals described above, for example). Note that this representation is a generalization of the representations for p(u) given by equations 3.1 and 3.2. K = ({aC : C ∈ VC }, {bS : S ∈ SC }) is a representation for p(U). A factorizable function p(U) can admit many different representations, that is, many different sets of clique and separator functions that satisfy equation 5.2 given a particular p(U). The JLO algorithm carries out globally consistent probability calculations via local message passing on the junction tree; probability information is passed between neighboring cliques, and clique and separator potentials are updated based on this local information. A key point is that the cliques and separators are updated in a fashion that ensures that at all times K is a representation for p(U); in other words, equation 5.2 holds at all times. Eventually the propagation converges to the marginal representation given the initial model and the observed evidence.

Probabilistic Independence Networks

245

The message passing proceeds as follows: We can define a flow from clique Ci to Cj in the following manner where Ci and Cj are two cliques adjacent in the junction tree. Let Sk be the separator for these two cliques. Define X aCi (xCi ) (5.3) b∗Sk (xSk ) = Ci \Sk

where the summation is over the state-space of variables that are in Ci but not in Sk , and a∗Cj (xCj ) = aCj (xCj )λSk (xSk )

(5.4)

where λSk (xSk ) =

b∗Sk (xSk ) bSk (xSk )

.

(5.5)

λSk (xSk ) is the update factor. Passage of a flow corresponds to updating the neighboring clique with the probability information contained in the originating clique. This flow induces a new representation K∗ = ({a∗C : C ∈ VC }, {b∗S : S ∈ SC }) for p(U). A schedule of such flows can be defined such that all cliques are eventually updated with all relevant information and the junction tree reaches an equilibrium state. The most direct scheduling scheme is a two-phase operation where one node is denoted the root of the junction tree. The collection phase involves passing flows along all edges toward the root clique (if a node is scheduled to have more than one incoming flow, the flows are absorbed sequentially). Once collection is complete, the distribution phase involves passing flows out from this root in the reverse direction along the same edges. There are at most two flows along any edge in the tree in a nonredundant schedule. Note that the directionality of the flows in the junction tree need have nothing to do with any directed edges in the original DPIN structure. 5.4 The JLO Algorithm for Inference Given Observed Evidence. The particular case of calculating the effect of observed evidence (inference) is handled in the following manner: Consider that we observe evidence of the form e = {Xi = x∗i , Xj = xj∗ , . . .}, and Ue = {Xi , Xj , . . .} denotes the set of variables observed. Let Uh = U \ Ue denote the set of hidden or unobserved variables and uh a value assignment for Uh . Consider the calculation of p(Uh |e). Define an evidence function ge (xi ) such that ½ 1 if xi = x∗i e (5.6) g (xi ) = 0 otherwise.

246

Padhraic Smyth, David Heckerman, and Michael I. Jordan

Let f ∗ (u) = p(u)

Y

ge (xi ).

(5.7)

Ue

Thus, we have that f ∗ (u) ∝ p(uh |e). To obtain f ∗ (u) by operations on the junction tree, one proceeds as follows: First assign each observed variable Xi ∈ Ue to one particular clique that contains it (this is termed “entering the evidence into the clique”). Let CE denote the set of all cliques into which evidence is entered in this manner. For each C ∈ CE let Y ge (xi ) . (5.8) gC (xC ) = {i: Xi is entered into C}

Thus, f ∗ (u) = p(u) ×

Y

gC (xC ) .

(5.9)

C∈CE

One can now propagate the effects of these modifications throughout the tree using the collect-and-distribute schedule described in Section 5.3. Let xhC denote a value assignment of the hidden (unobserved) variables in clique C. When the schedule of flows is complete, one gets a new representation K∗f such that the local potential on each clique is f ∗ (xC ) = p(xhC , e), that is, the joint probability of the local unobserved clique variables and the observed evidence (Jensen et al. 1990) (similarly for the separator potential functions). If one marginalizes at the clique over the unobserved local clique variables, X

p(xhC , e) = p(e),

(5.10)

XCh

one gets the probability of the observed evidence directly. Similarly, if one normalizes the potential function at a clique to sum to one, one obtains the conditional probability of the local unobserved clique variables given the evidence, p(xhC |e). 5.5 Complexity of the Propagation Step of the JLO Algorithm. In general, the time complexity T of propagation within a junction tree is P C O( N i=1 s(Ci )) where NC is the number of cliques in the junction tree and s(Ci ) is the number of states in the clique state-space of Ci . Thus, for inference to be efficient, we need to construct junction trees with small clique sizes. Problems of finding optimally small junction trees (e.g., finding the junction tree with the smallest maximal clique) are NP-hard. Nonetheless, the heuristic algorithm for triangulation described earlier has been found to work well in practice (Jensen et al. 1990).

Probabilistic Independence Networks

247

6 Inference and MAP Calculations in HMM(1,1) 6.1 The F-B Algorithm for HMM(1,1) Is a Special Case of the JLO Algorithm. Figure 5b shows the junction tree for HMM(1,1). One can apply the JLO algorithm to the HMM(1,1) junction tree structure to obtain a particular inference algorithm for HMM(1,1). The HMM(1,1) inference problem consists of being given a set of values for the observable variables, e = {O1 = o1 , O2 = o2 , . . . , ON = oN }

(6.1)

and inferring the likelihood of e given the model. As described in the previous section, this problem can be solved exactly by local propagation in any junction tree using the JLO inference algorithm. In Appendix A it is shown that both the forward and backward steps of the F-B procedure for HMM(1,1) are exactly recreated by the more general JLO algorithm when the HMM(1,1) is viewed as a PIN. This equivalence is not surprising since both algorithms are solving exactly the same problem by local recursive updating. The equivalence is useful because it provides a link between well-known HMM inference algorithms and more general PIN inference algorithms. Furthermore, it clearly demonstrates how the PIN framework can provide a direct avenue for analyzing and using more complex hidden Markov probability models (we will discuss such HMMs in Section 8). When evidence is entered into the observable states and assuming m discrete states per hidden variable, the computational complexity of solving the inference problem via the JLO algorithm is O(Nm2 ) (the same complexity as the standard F-B procedure). Note that the obvious structural equivalence between PIN structures and HMM(1,1) has been noted before by Buntine (1994), Frasconi and Bengio (1994), and Lucke (1995) among others; however, this is the first publication of equivalence of specific inference algorithms as far as we are aware. 6.2 Equivalence of Dawid’s Propagation Algorithm for Identifying MAP Assignments and the Viterbi Algorithm. Consider that one wishes to calculate fˆ(uh , e) = maxx1 ,...,xK p(x1 , . . . , xK , e) and also to identify a set of values of the unobserved variables that achieve this maximum, where K is the number of unobserved (hidden) variables. This calculation can be achieved using a local propagation algorithm on the junction tree with two modifications to the standard JLO inference algorithm. This algorithm is due to Dawid (1992a); this is the most general algorithm from a set of related methods. First, during a flow, the marginalization of the separator is replaced by bˆS (xS ) = max aC (xC ), C\S

(6.2)

248

Padhraic Smyth, David Heckerman, and Michael I. Jordan

where C is the originating clique for the flow. The definition for λS (xS ) is also changed in the obvious manner. Second, marginalization within a clique is replaced by maximization: fˆC = max p(u) . u\xC

(6.3)

Given these two changes, it can be shown that if the same propagation operations are carried out as described earlier, the resulting representation Kˆ f at equilibrium is such that the potential function on each clique C is fˆ(xC ) = max p(xhC , e, {uh \ xC }), uh \xC

(6.4)

where xhC denotes a value assignment of the hidden (unobserved) variables in clique C. Thus, once the Kˆ f representation is obtained, one can locally identify the values of XCh , which maximize the full joint probability as xˆ hC = argxh fˆ(xC ) . C

(6.5)

In the probabilistic expert systems literature, this procedure is known as generating the most probable explanation (MPE) given the observed evidence (Pearl 1988). The HMM(1,1) MAP problem consists of being given a set of values for the observable variables, e = {O1 = o1 , O2 = o2 , . . . , ON = oN }, and inferring max p(h1 , . . . , hN , e)

h1 ,...,hN

(6.6)

or the set of arguments that achieve this maximum. Since Dawid’s algorithm is applicable to any junction tree, it can be directly applied to the HMM(1,1) junction tree in Figure 5b. In Appendix B it is shown that Dawid’s algorithm, when applied to HMM(1,1), is exactly equivalent to the standard Viterbi algorithm. Once again the equivalence is not surprising: Dawid’s method and the Viterbi algorithm are both direct applications of dynamic programming to the MAP problem. However, once again, the important point is that Dawid’s algorithm is specified for the general case of arbitrary PIN structures and can thus be directly applied to more complex HMMs than HMM(1,1) (such as those discussed later in Section 8). 7 Inference and MAP Algorithms for UPINs In Section 5 we described the JLO algorithm for local inference given a DPIN. For UPINs the procedure is very similar except for two changes to

Probabilistic Independence Networks

249

the overall algorithm: the moralization step is not necessary, and initialization of the junction tree is less trivial. In Section 5.2 we described how to go from a specification of conditional probabilities in a directed graph to an initial potential function representation on the cliques in the junction tree. To utilize undirected links in the model specification process requires new machinery to perform the initialization step. In particular we wish to compile the model into the standard form of a product of potentials on the cliques of a triangulated graph (cf. equation 3.1): P(u) =

Y

aC (xC ) .

C∈VC

Once this initialization step has been achieved, the JLO propagation procedure proceeds as before. Consider the chordless cycle shown in Figure 4b. Suppose that we parameterize the probability distribution on this graph by specifying pairwise marginals on the four pairs of neighboring nodes. We wish to convert such a local specification into a globally consistent joint probability distribution, that is, a marginal representation. An algorithm known as iterative proportional fitting (IPF) is available to perform this conversion. Classically, IPF proceeds as follows (Bishop et al. 1973): Suppose for simplicity that all of the random variables are discrete (a gaussian version of IPF is also available, Whittaker 1990) such that the joint distribution can be represented as a table. The table is initialized with equal values in all of the cells. For each marginal in turn, the table is then rescaled by multiplying every cell by the ratio of the desired marginal to the corresponding marginal in the current table. The algorithm visits each marginal in turn, iterating over the set of marginals. If the set of marginals is consistent with a single joint distribution, the algorithm is guaranteed to converge to the joint distribution. Once the joint is available, the potentials in equation 3.1 can be obtained (in principle) by marginalization. Although IPF solves the initialization problem in principle, it is inefficient. Jiˇrousek and Pˇreuˇcil (1995) developed an efficient version of IPF that avoids the need for both storing the joint distribution as a table and explicit marginalization of the joint to obtain the clique potentials. Jiˇrousek’s version of IPF represents the evolving joint distribution directly in terms of junction tree potentials. The algorithm proceeds as follows: Let I be a set of subsets of V. For each I ∈ I , let q(xI ) denote the desired marginal on the subset I. Let the joint distribution be represented as a product over junction tree potentials (see equation 3.1), where each aC is initialized to an arbitrary constant. Visit each I ∈ I in turn, updating the corresponding clique potential aC (i.e, that potential aC for which I ⊆ C) as follows: a∗C (xC ) = aC (xC )

q(xI ) . p(xI )

250

Padhraic Smyth, David Heckerman, and Michael I. Jordan

The marginal p(xI ) is obtained via the JLO algorithm, using the current set of clique potentials. Intelligent choices can be made for the order in which to visit the marginals to minimize the amount of propagation needed to compute p(xI ). This algorithm is simply an efficient way of organizing the IPF calculations and inherits the latter’s guarantees of convergence. Note that the Jiˇrousek and Pˇreuˇcil algorithm requires a triangulation step in order to form the junction tree used in the calculation of p(xI ). In the worst case, triangulation can yield a highly connected graph, in which case the Jiˇrousek and Pˇreuˇcil algorithm reduces to classical IPF. For sparse graphs, however, when the maximum clique is much smaller than the entire graph, the algorithm should be substantially more efficient than classical IPF. Moreover, the triangulation algorithm itself need only be run once as a preprocessing step (as is the case for the JLO algorithm). 8 More Complex HMMs for Speech Modeling Although HMMs have provided an exceedingly useful framework for the modeling of speech signals, it is also true that the simple HMM(1,1) model underlying the standard framework has strong limitations as a model of speech. Real speech is generated by a set of coupled dynamical systems (lips, tongue, glottis, lungs, air columns, etc.), each obeying particular dynamical laws. This coupled physical process is not well modeled by the unstructured state transition matrix of HMM(1,1). Moreover, the first-order Markov properties of HMM(1,1) are not well suited to modeling the ubiquitous coarticulation effects that occur in speech, particularly coarticulatory effects that extend across several phonemes (Kent and Minifie 1977). A variety of techniques have been developed to surmount these basic weaknesses of the HMM(1,1) model, including mixture modeling of emission probabilities, triphone modeling, and discriminative training. All of these methods, however, leave intact the basic probabilistic structure of HMM(1,1) as expressed by its PIN structure. In this section we describe several extensions of HMM(1,1) that assume additional probabilistic structure beyond that assumed by HMM(1,1). PINs provide a key tool in the study of these more complex models. The role of PINs is twofold: they provide a concise description of the probabilistic dependencies assumed by a particular model, and they provide a general algorithm for computing likelihoods. This second property is particularly important because the existence of the JLO algorithm frees us from having to derive particular recursive algorithms on a case-by-case basis. The first model that we consider can be viewed as a coupling of two HMM(1,1) chains (Saul and Jordan, 1995). Such a model can be useful in general sensor fusion problems, for example, in the fusion of an audio signal with a video signal in lipreading. Because different sensory signals generally have different bandwidths, it may be useful to couple separate Markov models that are developed specifically for each of the individual signals. The

Probabilistic Independence Networks

251

alternative is to force the problem into an HMM(1,1) framework by either oversampling the slower signal, which requires additional parameters and leads to a high-variance estimator, or downsampling the faster signal, which generally oversmoothes the data and yields a biased estimator. Consider the HMM(1,2) structure shown in Figure 8a. This model involves two HMM(1,1) backbones that are coupled together by undirected links between the state variables. Let Hi(1) and O(1) i denote the ith state and ith output of the “fast” chain, respectively, and let Hi(2) and O(2) i denote the ith state and ith output of the “slow” chain. Suppose that the fast chain is sampled τ times as often is connected to Hi(2) for i0 equal to τ (i − 1) + 1. as the slow chain. Then Hi(1) 0 0 Given this value for i , the Markov model for the coupled chain implies the following conditional independencies for the state variables: (1) (1) (2) (2) (1) (1) (2) (2) (2) {Hi(1) 0 , Hi } ⊥ {H1 , O1 , H1 , O1 , . . . , Hi0 −2 , Oi0 −2 , Hi−2 , Oi−2 , (2) (1) (2) O(1) i0 −1 , Oi−1 } | {Hi0 −1 , Hi−1 },

(8.1)

as well as the following conditional independencies for the output variables: (1) (1) (2) (2) (1) (1) (2) {O(1) i0 , Oi } ⊥ {H1 , O1 , H1 , O1 , . . . , Hi0 −1 , Oi0 −1 , (2) (1) (2) , O(2) Hi−1 i−1 } | {Hi0 , Hi } .

(8.2)

Additional conditional independencies can be read off the UPIN structure (see Figure 8a). As is readily seen in Figure 8a, the HMM(1,2) graph is not triangulated; thus, the HMM(1,2) probability model is not decomposable. However, the graph can be readily triangulated to form a decomposable cover for the HMM(1,2) probability model (see Section 3.1.2). The JLO algorithm provides an efficient algorithm for calculating likelihoods in this graph. This can be seen in Figure 8b, where we show a triangulation of the HMM(1,2) graph. The triangulation adds O(Nh ) links to the graph (where Nh is the number of hidden nodes in the graph) and creates a junction tree in which each clique is a cluster of three state variables from the underlying UPIN structure. Assuming m values for each state variable in each chain, we obtain an algorithm whose time complexity is O(Nh m3 ). This can be compared to the naive approach of transforming the HMM(1,2) model to a Cartesian product HMM(1,1) model, which not only has the disadvantage of requiring subsampling or oversampling but also has a time complexity of O(Nh m4 ). Directed graph semantics can also play an important role in constructing interesting variations on the HMM theme. Consider Figure 9a, which shows an HMM(1,2) model in which a single output stream is coupled to a pair of underlying state sequences. In a speech modeling application, such a structure might be used to capture the fact that a given acoustic pattern can have multiple underlying articulatory causes. For example, equivalent shifts in

252

Padhraic Smyth, David Heckerman, and Michael I. Jordan

(a)

(b)

Figure 8: (a) The UPIN structure for the HMM(1,2) model with τ = 2. (b) A triangulation of this UPIN structure.

Probabilistic Independence Networks

253

formant frequencies can be caused by lip rounding or tongue raising; such phenomena are generically refered to as “trading relations” in the speech psychophysics literature (Lindblom 1990; Perkell et al. 1993). Once a particular acoustic pattern is observed, the causes become dependent; thus, for example, evidence that the lips are rounded would act to discount inferences that the tongue has been raised. These inferences propagate forward and backward in time and couple the chains. Formally, these induced dependencies are accounted for by the links added between the state sequences during the moralization of the graph (see Figure 9b). This figure shows that the underlying calculations for this model are closely related to those of the earlier HMM(1,2), but the model specification is very different in the two cases. Saul and Jordan (1996) have proposed a second extension of the HMM(1,1) model that is motivated by the desire to provide a more effective model of coarticulation (see also Stolorz 1994). In this model, shown in Figure 10, coarticulatory influences are modeled by additional links between output variables and states along an HMM(1,1) backbone. One approach to performing calculations in this model is to treat it as a Kth-order Markov chain and transform it into an HMM(1,1) model by defining higher-order state variables. A graphical modeling approach is more flexible. It is possible, for example, to introduce links between states and outputs K time steps apart without introducing links for the intervening time intervals. More generally, the graphical modeling approach to the HMM(K,1) model allows the specification of different interaction matrices at different time scales; this is awkward in the Kth-order Markov chain formalism. The HMM(3,1) graph is triangulated as is, and thus the time complexity of the JLO algorithm is O(Nh m3 ). In general an HMM(K,1) graph creates cliques of size O(mK ), and the JLO algorithm runs in time O(Nh mK ). As these examples suggest, the graphical modeling framework provides a useful framework for exploring extensions of HMMs. The examples also make clear, however, that the graphical algorithms are no panacea. The mK complexity of HMM(K,1) will be prohibitive for large K. Also, the generalization of HMM(1,2) to HMM(1,K) (couplings of K chains) is intractable. Recent research has therefore focused on approximate algorithms for inference in such structures; see Saul and Jordan (1996) for HMM(K,1) and Ghahramani and Jordan (1996) and Williams and Hinton (1990) for HMM(1,K). These authors have developed an approximation methodology based on mean-field theory from statistical physics. While discussion of mean-field algorithms is beyond the scope of this article, it is worth noting that the graphical modeling framework plays a useful role in the development of these approximations. Essentially the mean-field approach involves creating a simplified graph for which tractable algorithms are available, and minimizing a probabilistic distance between the tractable graph and the intractable graph. The JLO algorithm is called as a subroutine on the tractable graph during the minimization process.

254

Padhraic Smyth, David Heckerman, and Michael I. Jordan

(a)

(b)

Figure 9: (a) The DPIN structure for HMM(1,2) with a single observable sequence coupled to a pair of underlying state sequences. (b) The moralization of this DPIN structure.

9 Learning and PINs Until now, we have assumed that the parameters and structure of a PIN are known with certainty. In this section, we drop this assumption and discuss methods for learning about the parameters and structure of a PIN. The basic idea behind the techniques that we discuss is that there is a true joint probability distribution described by some PIN structure and parameters, but we are uncertain about this structure and its parameters. We

Probabilistic Independence Networks

255

Figure 10: The UPIN structure for HMM(3,1).

are unable to observe the true joint distribution directly, but we are able to observe a set of patterns u1 , . . . , uM that is a random sample from this true distribution. These patterns are independent and identically distributed (i.i.d.) according to the true distribution (note that in a typical HMM learning problem, each of the ui consist of a sequence of observed data). We use these data to learn about the structure and parameters that encode the true distribution. 9.1 Parameter Estimation for PINs. First, let us consider the situation where we know the PIN structure S of the true distribution with certainty but are uncertain about the parameters of S. In keeping with the rest of the article, let us assume that all variables in U are discrete. Furthermore, for purposes of illustration, let us assume that S is an ADG. Let xki and pa(Xi ) j denote the kth value of variable Xi and jth configuration of variables pa(Xi ) in S, respectively (j = 1, . . . , qi , k = 1, . . . , ri ). As we have just discussed, we assume that each conditional probability p(xki |pa(Xi ) j ) is an uncertain parameter, and for convenience we represent this parameter as θijk . We use θ ij to denote the vector of parameters , . . . , θijri ) and θ s to denote the vector of all parameters for S. Note that (θ Pij1 ri k=1 θijk = 1 for every i and j. One method for learning about the parameters θ s is the Bayesian approach. We treat the parameters θ s as random variables, assign these parameters a prior distribution p(θ s |S), and update this prior distribution with data D = (u1 , . . . , uM ) according to Bayes’ rule: p(θ s | D, S) = c · p(θ s | S) p(D | θ s , S),

(9.1)

where c is a normalization constant that does not depend on θ s . Because the patterns in D are a random sample, equation 9.1 simplifies to p(θ s | D, S) = c · p(θ s | S)

M Y l=1

p(ul | θ s , S) .

(9.2)

256

Padhraic Smyth, David Heckerman, and Michael I. Jordan

(a)

(b)

Figure 11: A Bayesian network structure for a two-binary-variable domain {X1 , X2 } showing (a) conditional independencies associated with the random sample assumption and (b) the added assumption of parameter independence. In both parts of the figure, it is assumed that the network structure X1 → X2 is generating the patterns.

Given some prediction of interest that depends on θ s and S—say, f (θ s , S)— we can use the posterior distribution of θ s to compute an expected prediction: Z f (θ s , S) p(θ s | D, S) dθ s . (9.3) E( f (θ s , S) | D, S) = Associated with our assumption that the data D are a random sample from structure S with uncertain parameters θ s is a set of conditional independence assertions. Not surprisingly, some of these assumptions can be represented as a (directed) PIN that includes both the possible observations and the parameters as variables. Figure 11a shows these assumptions for the case where U = {X1 , X2 } and S is the structure with a directed edge from X1 to X2 . Under certain additional assumptions, described, for example, in Spiegelhalter and Lauritzen (1990), the evaluation of equation 9.2 is straightfor-

Probabilistic Independence Networks

257

ward. In particular, if each pattern ul is complete (i.e., every variable is observed), we have p(ul | θ s , S) =

qi Y ri N Y Y

δijkl

θijk ,

(9.4)

i=1 j=1 k=1

where δijkl is equal to one if Xi = xki and pa(Xi ) = pa(Xi ) j in pattern Cl and zero otherwise. Combining equations 9.2 and 9.4, we obtain p(θ s | D, S) = c · p(θ s | S)

qi Y ri N Y Y

Nijk

θijk ,

(9.5)

i=1 j=1 k=1

where Nijk is the number of patterns in which Xi = xki and pa(Xi ) = pa(Xi ) j . The Nijk are the sufficient statistics for the random sample D. If we assume that the parameter vectors θ ij , i = 1, . . . , n, j = 1, . . . , qi are mutually independent, an assumption we call parameter independence, then we get the additional simplification p(θ s | D, S) = c

qi N Y Y

p(θ ij | S)

i=1 j=1

ri Y

Nijk

θijk .

(9.6)

k=1

The assumption of parameter independence for our two-variable example is illustrated in Figure 11b. Thus, given complete data and parameter independence, each parameter vector θ ij can be updated independently. The update is particularly simple if each parameter vector has a conjugate distribution. For a discrete variable with discrete parents, the natural conjugate distribution is the Dirichlet, p(θ ij | S) ∝

ri Y

αijk −1

θijk

,

k=1

in which case equation 9.6 becomes p(θ s | D, S) = c

qi Y ri N Y Y

Nijk +αijk −1

θijk

.

(9.7)

i=1 j=1 k=1

Other conjugate distributions include the normal Wishart distribution for the parameters of gaussian codebooks and the Dirichlet distribution for the mixing coefficients of gaussian-mixture codebooks (DeGroot 1970; Buntine 1994; Heckerman and Geiger 1995). Heckerman and Geiger (1995) describe a simple method for assessing these priors. These priors have also been used for learning parameters in standard HMMs (e.g., Gauvain and Lee 1994).

258

Padhraic Smyth, David Heckerman, and Michael I. Jordan

Parameter independence is usually not assumed in general for HMM structures. For example, in the HMM(1,1) model, a standard assumption is that p(Hi |Hi−1 ) = p(Hj |Hj−1 ) and p(Oi |Hi ) = p(Oj |Hj ) for all appropriate i and j. Fortunately, parameter equalities such as these are easily handled in the framework above (see Thiesson 1995 for a detailed discussion). In addition, the assumption that patterns are complete is clearly inappropriate for HMM structures in general, where some of the variables are hidden from observation. When data are missing, the exact evaluation of the posterior p(θ s |D, S) is typically intractable, so we turn to approximations. Accurate but slow approximations are based on Monte Carlo sampling (e.g., Neal 1993). An approximation that is less accurate but more efficient is one based on the observation that, under certain conditions, the quantity p(θ s |S) · p(D|θ s , S) converges to a multivariate gaussian distribution as the sample size increases (see, e.g., Kass et al. 1988; MacKay, 1992a, 1992b). Less accurate but more efficient approximations are based on the observation that the gaussian distribution converges to a delta function centered at the maximum a posteriori (MAP) and eventually the maximum likelihood (ML) value of θ s . For the standard HMM(1,1) model discussed in this article, where either discrete, gaussian, or gaussian-mixture codebooks are used, an ML or MAP estimate is a well-known efficient approximation (Poritz 1988; Rabiner 1989). MAP and ML estimates can be found using traditional techniques such as gradient descent and expectation-maximization (EM) (Dempster et al., 1977). The EM algorithm can be applied efficiently whenever the likelihood function has sufficient statistics that are of fixed dimension for any data set. The EM algorithm finds a local maximum by initializing the parameters θ s (e.g., at random or by some clustering algorithm) and repeating E and M steps to convergence. In the E step, we compute the expected sufficient statistic for each of the parameters, given D and the current values for θ s . In particular, if all variables are discrete and parameter independence is assumed to hold, and all priors are Dirichlet, we obtain E(Nijk | D, θ s , S) =

M X

p(xki , pa(Xi ) j | ul , θ s , S).

l=1

An important feature of the EM algorithm applied to PINs under these assumptions is that each term in the sum can be computed using the JLO algorithm. The JLO algorithm may also be used when some parameters are equal and when the likelihoods of some variables are gaussian or gaussianmixture distributions (Lauritzen and Wermuth 1989). In the M step, we use the expected sufficient statistics as if they were actual sufficient statistics and set the new values of θ s to be the MAP or ML values given these statistics. Again, if all variables are discrete, parameter independence is assumed to

Probabilistic Independence Networks

259

hold, and all priors are Dirichlet, the ML is given by E(Nijk | D, θ s , S) , θijk = Pri k=1 E(Nijk | D, θ s , S) and the MAP is given by θijk = Pri

E(Nijk | D, θ s , S) + αijk − 1

k=1 (E(Nijk

| D, θ s , S) + αijk − 1)

.

9.2 Model Selection and Averaging for PINs. Now let us assume that we are uncertain not only about the parameters of a PIN but also about the true structure of a PIN. For example, we may know that the true structure is an HMM(K, J) structure, but we may be uncertain about the values of K and J. One solution to this problem is Bayesian model averaging. In this approach, we view each possible PIN structure (without its parameters) as a model. We assign prior probabilities p(S) to different models, and compute their posterior probabilities given data: Z p(S | D) ∝ p(S) p(D | S) = p(S)

p(D | θ , S) p(θ | S) dθ .

(9.8)

As indicated in equation 9.8, we compute p(D|S) by averaging the likelihood of the data over the parameters of S. In addition to computing the posterior probabilities of models, we estimate the parameters of each model either by computing the distribution p(θ |D, S) or using a gaussian, MAP, or ML approximation for this distribution. We then make a prediction of interest based on each model separately, as in equation 9.3, and compute the weighted average of these predictions using the posterior probabilities of models as weights. One complication with this approach is that when data are missing—for example, when some variables are hidden—the exact computation of the integral in equation 9.8 is usually intractable. As discussed in the previous section, Monte Carlo and gaussian approximations may be used. One simple form of a gaussian approximation is the Bayesian information criterion (BIC) described by Schwarz (1978), d log p(D | S) ≈ log p(D | θˆs , S) − log M, 2 where θˆs is the ML estimate, M is the number of patterns in D, and d is the dimension of S—typically, the number of parameters of S. The first term of this “score” for S rewards how well the data fit S, whereas the second term punishes model complexity. Note that this score does not depend on the

260

Padhraic Smyth, David Heckerman, and Michael I. Jordan

parameter prior, and thus can be applied easily.3 For examples of applications of BIC in the context of PINs and other statistical models, see Raftery (1995). The BIC score is the additive inverse of Rissanen’s (1987) minimum description length (MDL). Other scores, which can be viewed as approximations to the marginal likelihood, are hypothesis testing (Raftery 1995) and cross validation (Dawid 1992b). Buntine (in press) provides a comprehensive review of scores for model selection and model averaging in the context of PINs. Another complication with Bayesian model averaging is that there may be so many possible models that averaging becomes intractable. In this case, we select one or a handful of structures with high relative posterior probabilities and make our predictions with this limited set of models. This approach is called model selection. The trick here is finding a model or models with high posterior probabilities. Detailed discussions of search methods for model selection among PINs are given by, among others, Madigan and Raftery (1994), Heckerman et al. (1995), and Spirtes and Meek (1995). When the true model is some HMM(K, J) structure, we may have additional prior knowledge that strongly constrains the possible values of K and J. Here, exhaustive model search is likely to be practical. 10 Summary Probabilistic independence networks provide a useful framework for both the analysis and application of multivariate probability models when there is considerable structure in the model in the form of conditional independence. The graphical modeling approach both clarifies the independence semantics of the model and yields efficient computational algorithms for probabilistic inference. This article has shown that it is useful to cast HMM structures in a graphical model framework. In particular, the well-known F-B and Viterbi algorithms were shown to be special cases of more general algorithms from the graphical modeling literature. Furthermore, more complex HMM structures, beyond the traditional first-order model, can be analyzed profitably and directly using generally applicable graphical modeling techniques. Appendix A: The Forward-Backward Algorithm for HMM(1,1) Is a Special Case of the JLO Algorithm Consider the junction tree for HMM(1,1) as shown in Figure 5b. Let the final clique in the chain containing (HN−1 , HN ) be the root clique. Thus, a 3 One caveat: The BIC score is derived under the assumption that the parameter prior is positive throughout its domain.

Probabilistic Independence Networks

261

nonredundant schedule consists of first recursively passing flows from each (Oi , Hi ) and (Hi−2 , Hi−1 ) to each (Hi−1 , Hi ) in the appropriate sequence (the “collect” phase), and then distributing flows out in the reverse direction from the root clique. If we are interested only in calculating the likelihood of e given the model, then the distribute phase is not necessary since we can simply marginalize over the local variables in the root clique to obtain p(e). (Subscripts on potential functions and update factors indicate which variables have been used in deriving that potential or update factor; e.g., fO1 indicates that this potential has been updated based on information about O1 but not using information about any other variables.) Assume that the junction tree has been initialized so that the potential function for each clique and separator is the local marginal. Given the observed evidence e, each individual piece of evidence O = o∗i is entered into its clique (Oi , Hi ) such that each clique marginal becomes fO∗ i (hi , oi ) = p(hi , o∗i ) after entering the evidence (as in equation 5.8). Consider the portion of the junction tree in Figure 12, and in particular the flow between (Oi , Hi ) and (Hi−1 , Hi ). By definition the potential on the separator Hi is updated to fO∗ i (hi ) =

X

f ∗ (hi , oi ) = p(hi , o∗i ).

(A.1)

oi

The update factor from this separator flowing into clique (Hi−1 , Hi ) is then λOi (hi ) =

p(hi , o∗i ) = p(o∗i | hi ). p(hi )

(A.2)

This update factor is “absorbed” into (Hi−1 , Hi ) as follows: fO∗ i (hi−1 , hi ) = p(hi−1 , hi )λOi (hi ) = p(hi−1 , hi )p(o∗i | hi ).

(A.3)

Now consider the flow from clique (Hi−2 , Hi−1 ) to clique (Hi−1 , Hi ). Let 8i,j = {Oi , . . . , Oj } denote a set of consecutive observable variables and ∗ = {o∗ , . . . , o∗ } denote a set of observed values for these variables, 1 ≤ φi,j i j i < j ≤ N. Assume that the potential on the separator Hi−1 has been updated to ∗ ) f8∗ 1,i−1 (hi−1 ) = p∗ (hi−1 , φ1,i−1

(A.4)

by earlier flows in the schedule. Thus, the update factor on separator Hi−1 becomes λ81,i−1 (hi−1 ) =

∗ ) p∗ (hi−1 , φ1,i−1

p(hi−1 )

,

(A.5)

262

Padhraic Smyth, David Heckerman, and Michael I. Jordan

Figure 12: Local message passing in the HMM(1,1) junction tree during the collect phase of a left-to-right schedule. Ovals indicate cliques, boxes indicate separators, and arrows indicate flows.

and this gets absorbed into clique (Hi−1 , Hi ) to produce f8∗ 1,i (hi−1 , hi ) = fO∗ i (hi−1 , hi )λ81,i−1 (hi−1 ) = p(hi−1 , hi )p(o∗i | hi )

∗ ) p∗ (hi−1 , φ1,i−1

p(hi−1 ) ∗ = p(o∗i | hi )p(hi | hi−1 )p∗ (hi−1 , φ1,i−1 ).

(A.6)

Finally, we can calculate the new potential on the separator for the flow from clique (Hi−1 , Hi ) to (Hi , Hi+1 ): f8∗ 1,i (hi ) =

X

f8∗ 1,i (hi−1 , hi )

hi−1

= p(o∗i | hi )

X

(A.7)

∗ p(hi | hi−1 )p∗ (hi−1 , φ1,i−1 )

(A.8)

p(hi | hi−1 ) f8∗ 1,i−1 (hi−1 ) .

(A.9)

hi−1

= p(o∗i | hi )

X hi−1

Proceeding recursively in this manner, one finally obtains at the root clique ∗ f8∗ 1,N (hN−1 , hN ) = p(hN−1 , hN , φ1,N )

(A.10)

from which one can get the likelihood of the evidence, ∗ )= p(e) = p(φ1,N

X hN−1 ,hN

f8∗ 1,N (hN−1 , hN ).

(A.11)

Probabilistic Independence Networks

263

We note that equation A.9 directly corresponds to the recursive equation (equation 20 in Rabiner 1989) for the α variables used in the forward phase of the F-B algorithm, the standard HMM(1,1) inference algorithm. In particular, using a “left-to-right” schedule, the updated potential functions on the separators between the hidden cliques, the f8∗ 1,i (hi ) functions, are exactly the α variables. Thus, when applied to HMM(1,1), the JLO algorithm produces exactly the same local recursive calculations as the forward phase of the F-B algorithm. One can also show an equivalence between the backward phase of the F-B algorithm and the JLO inference algorithm. Let the “leftmost” clique in the chain, (H1 , H2 ), be the root clique, and define a schedule such that the flows go from right to left. Figure 13 shows a local portion of the clique tree and the associated flows. Consider that the potential on clique (Hi , Hi+1 ) has been updated already by earlier flows from the right. Thus, by definition, ∗ ). f8∗ i+1,N (hi , hi+1 ) = p(hi , hi+1 , φi+1,N

(A.12)

The potential on the separator between (Hi , Hi+1 ) and (Hi−1 , Hi ) is calculated as f8∗ i+1,N (hi ) =

X

∗ p(hi , hi+1 , φi+1,N )

hi+1

= p(hi )

X

(A.13)

∗ p(hi+1 | hi )p(o∗i+1 | hi+1 )p(φi+2,N | hi+1 )

(A.14)

hi+1

(by virtue of the various conditional independence relations in HMM(1,1)) = p(hi )

X

p(hi+1 | hi )p(o∗i+1 | hi+1 )

∗ , hi+1 ) p(φi+2,N

p(hi+1 )

hi+1

= p(hi )

X

p(hi | hi+1 )p(o∗i+1 | hi+1 )

f8∗ i+2,N (hi+1 )

hi+1

p(hi+1 )

.

(A.15) (A.16)

Defining the update factor on this separator yields λ∗8i+1,N (hi ) =

f8∗ i+2,N (hi )

p(hi ) X f8∗ (hi+1 ) p(hi | hi+1 )p(o∗i+1 | hi+1 ) i+2,N = p(hi+1 ) hi+1 X = p(hi | hi+1 )p(o∗i+1 | hi+1 )λ∗8i+2,N (hi+1 ) . hi+1

(A.17) (A.18) (A.19)

264

Padhraic Smyth, David Heckerman, and Michael I. Jordan

Figure 13: Local message passing in the HMM(1,1) junction tree during the collect phase of a right-to-left schedule. Ovals indicate cliques, boxes indicate separators, and arrows indicate flows.

This set of recursive equations in λ corresponds exactly to the recursive equation (equation 25 in Rabiner 1989) for the β variables in the backward phase of the F-B algorithm. In fact, the update factors λ on the separators are exactly the β variables. Thus, we have shown that the JLO inference algorithm recreates the F-B algorithm for the special case of the HMM(1,1) probability model. Appendix B: The Viterbi Algorithm for HMM(1,1) Is a Special Case of Dawid’s Algorithm. As with the inference problem, let the final clique in the chain containing (HN−1 , HN ) be the root clique and use the same schedule: first a left-to-right collection phase into the root clique, followed by a right-to-left distribution phase out from the root clique. Again it is assumed that the junction tree has been initialized so that the potential functions are the local marginals, and the observable evidence e has been entered into the cliques in the same manner as described for the inference algorithm. We refer again to Figure 12. The sequence of flow and absorption operations is identical to that of the inference algorithm with the exception that marginalization operations are replaced by maximization. Thus, the potential on the separator between (Oi , Hi ) and (Hi−1 , Hi ) is initially updated to fˆOi (hi ) = max p(hi , oi ) = p(hi , o∗i ). oi

(B.1)

Probabilistic Independence Networks

265

The update factor for this separator is λOi (hi ) =

p(hi , o∗i ) = p(o∗i | hi ), p(hi )

(B.2)

and after absorption into the clique (Hi−1 , Hi ) one gets fˆOi (hi−1 , hi ) = p(hi−1 , hi )p(o∗i | hi ).

(B.3)

Now consider the flow from clique (Hi−2 , Hi−1 ) to (Hi−1 , Hi ). Let Hi,j = {Hi , . . . , Hj } denote a set of consecutive observable variables and h∗i,j = {h∗i , . . . , hj∗ }, denote the observed values for these variables, 1 ≤ i < j ≤ N. Assume that the potential on separator Hi−1 has been updated to ∗ ) fˆ81,i−1 (hi−1 ) = max p(hi−1 , h1,i−2 , φ1,i−1

(B.4)

h1,i−2

by earlier flows in the schedule. Thus, the update factor for separator Hi−1 becomes λ81,i−1 (hi−1 ) =

∗ ) maxh1,i−2 p(hi−1 , h1,i−2 , φ1,i−1

p(hi−1 )

,

(B.5)

and this gets absorbed into clique (Hi−1 , Hi ) to produce (B.6) fˆ81,i (hi−1 , hi ) = fˆOi (hi−1 , hi )λ81,i−1 (hi−1 ) ∗ maxh1,i−2 p(hi−1 , h1,i−2 , φ1,i−1 ) . (B.7) = p(hi−1 , hi )p(o∗i | hi ) p(hi−1 ) We can now obtain the new potential on the separator for the flow from clique (Hi−1 , Hi ) to (Hi , Hi+1 ), fˆ81,i (hi ) = max fˆ81,i (hi−1 , hi ) =

hi−1 p(o∗i

(B.8)

∗ | hi ) max{p(hi | hi−1 ) max p(hi−1 , h1,i−2 , φ1,i−1 )} hi−1

h1,i−2

(B.9)

∗ = p(o∗i | hi ) max{p(hi | hi−1 )p(hi−1 , h1,i−2 , φ1,i−1 )}

(B.10)

∗ = max p(hi , h1,i−1 , φ1,i ),

(B.11)

h1,i−1

h1,i−1

which is the result one expects for the updated potential at this clique. Thus, we can express the separator potential fˆ81,i (hi ) recursively (via equation B.10) as fˆ81,i (hi ) = p(o∗i | hi ) max{p(hi | hi−1 ) fˆ81,i−1 (hi−1 )}. hi−1

(B.12)

266

Padhraic Smyth, David Heckerman, and Michael I. Jordan

This is the same recursive equation as used in the δ variables in the Viterbi algorithm (equation 33a in Rabiner 1989): the separator potentials in Dawid’s algorithm using a left-to-right schedule are exactly the same as the δ’s used in the Viterbi method for solving the MAP problem in HMM(1,1). Proceeding recursively in this manner, one finally obtains at the root clique ∗ ), fˆ81,N (hN−1 , hN ) = max p(hN−1 , hN , hN−2 , φ1,N h1,N−2

(B.13)

from which one can get the likelihood of the evidence given the most likely state of the hidden variables: fˆ(e) = max fˆ81,N (hN−1 , hN ) hN−1 ,hN

∗ = max p(h1,N , φ1,N ). h1,N

(B.14) (B.15)

Identification of the values of the hidden variables that maximize the evidence likelihood can be carried out in the standard manner as in the Viterbi method, namely, by keeping a pointer at each clique along the flow in the forward direction back to the previous clique and then backtracking along this list of pointers from the root clique after the collection phase is complete. An alternative approach is to use the distribute phase of the Dawid algorithm. This has the same effect: Once the distribution flows are completed, each local clique can calculate both the maximum value of the evidence likelihood given the hidden variables and the values of the hidden variables in this maximum that are local to that particular clique. Acknowledgments MIJ gratefully acknowledges discussions with Steffen Lauritzen on the application of the IPF algorithm to UPINs. The research described in this article was carried out in part by the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. References Baum, L. E., and Petrie, T. 1966. Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37, 1554–1563. Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. 1973. Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA. Buntine, W. 1994. Operations for learning with graphical models. Journal of Artificial Intelligence Research 2, 159–225. Buntine, W. In press. A guide to the literature on learning probabilistic networks from data. IEEE Transactions on Knowledge and Data Engineering.

Probabilistic Independence Networks

267

Dawid, A. P. 1992a. Applications of a general propagation algorithm for probabilistic expert systems. Statistics and Computing 2, 25–36. Dawid, A. P. 1992b. Prequential analysis, stochastic complexity, and Bayesian inference (with discussion). In Bayesian Statistics 4, J. M. Bernardo, J. Berger, A. P. Dawid, and A. F. M. Smith, eds., pp. 109–125. Oxford University Press, London. DeGroot, M. 1970. Optimal Statistical Decisions. McGraw-Hill, New York. Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38. Elliott, R. J., Aggoun, L., and Moore, J. B. 1995. Hidden Markov Models: Estimation and Control. Springer-Verlag, New York. Frasconi, P., and Bengio, Y. 1994. An EM approach to grammatical inference: Input/output HMMs. In Proceedings of the 12th IAPR Intl. Conf. on Pattern Recognition, pp. 289–294. IEEE Computer Society Press, Los Altimos, CA. Gauvain, J., and Lee, C. 1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Sig. Audio Proc. 2, 291–298. Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Patt. Anal. Mach. Intell. 6, 721–741. Ghahramani, Z., and Jordan, M. I. 1996. Factorial hidden Markov models. In Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds., pp. 472–478. MIT Press, Cambridge, MA. Heckerman, D., and Geiger, D. 1995. Likelihoods and Priors for Bayesian Networks. MSR-TR-95-54. Microsoft Corporation, Redmond, WA. Heckerman, D., Geiger, D., and Chickering, D. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20, 197–243. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., vol. 1, chap. 7. MIT Press, Cambridge, MA. Huang, X. D., Ariki, Y., and Jack, M. A. 1990. Hidden Markov Models for Speech Recognition. Edinburgh University Press, Edinburgh. Isham, V. 1981. An introduction to spatial point processes and Markov random fields. International Statistical Review 49, 21–43. Itzykson, C., and Drouff´e, J-M. 1991. Statistical Field Theory. Cambridge University Press, Cambridge. Jensen, F. V., Lauritzen, S. L., and Olesen, K. G. 1990. Bayesian updating in recursive graphical models by local computations. Computational Statistical Quarterly 4, 269–282. Jiˇrousek, R., and Pˇreuˇcil, S. 1995. On the effective implementation of the iterative proportional fitting procedure. Computational Statistics and Data Analysis 19, 177–189.

268

Padhraic Smyth, David Heckerman, and Michael I. Jordan

Kass, R., Tierney, L., and Kadane, J. 1988. Asymptotics in Bayesian computation. In Bayesian Statistics 3, J. Bernardo, M. DeGroot, D. Lindley, and A. Smith, eds., pp. 261–278. Oxford University Press, Oxford. Kent, R. D., and Minifie, F. D. 1977. Coarticulation in recent speech production models. Journal of Phonetics 5, 115–117. Lauritzen, S. L., and Spiegelhalter, D. J. 1988. Local computations with probabilities on graphical structures and their application to expert systems (with discussion). J. Roy. Statist. Soc. Ser. B. 50, 157–224. Lauritzen, S. L., Dawid, A. P., Larsen, B. N., and Leimer, H. G. 1990. Independence properties of directed Markov fields. Networks 20, 491–505. Lauritzen, S., and Wermuth, N. 1989. Graphical models for associations between variables, some of which are qualitative and some quantitative. Annals of Statistics 17, 31–57. Lindblom, B. 1990. Explaining phonetic variation: A sketch of the H&H theory. In Speech Production and Speech Modeling, W. J. Hardcastle and A. Marchal, eds., pp. 403–440. Kluwer, Dordrecht. Lucke, H. 1995. Bayesian belief networks as a tool for stochastic parsing. Speech Communication 16, 89–118. MacKay, D. J. C. 1992a. Bayesian interpolation. Neural Computation 4, 415–447. MacKay, D. J. C. 1992b. A practical Bayesian framework for backpropagation networks. Neural Computation 4, 448–472. Madigan, D., and Raftery, A. E. 1994. Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Am. Stat. Assoc. 89, 1535–1546. Modestino, J., and Zhang, J. 1992. A Markov random field model-based approach to image segmentation. IEEE Trans. Patt. Anal. Mach. Int. 14(6), 606– 615. Morgenstern, I., and Binder, K. 1983. Magnetic correlations in two-dimensional spin-glasses. Physical Review B 28, 5216. Neal, R. 1993. Probabilistic inference using Markov chain Monte Carlo methods. CRGTR-93–1. Department of Computer Science, University of Toronto. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA. Pearl, J., Geiger, D., and Verma, T. 1990. The logic of influence diagrams. In Influence Diagrams, Belief Nets, and Decision Analysis, R. M. Oliver and J. Q. Smith, eds., pp. 67–83. John Wiley, Chichester, UK. Perkell, J. S., Matthies, M. L., Svirsky, M. A., and Jordan, M. I. 1993. Trading relations between tongue-body raising and lip rounding in production of the vowel /u/: A pilot motor equivalence study. Journal of the Acoustical Society of America 93, 2948–2961. Poritz, A. M. 1988. Hidden Markov models: A guided tour. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1:7–13, IEEE Press, New York. Rabiner, L. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–285.

Probabilistic Independence Networks

269

Raftery, A. 1995. Bayesian model selection in social research (with discussion). In Sociological Methodology, P. Marsden, ed., pp. 111–196. Blackwell, Cambridge, MA. Rissanen, J. 1987. Stochastic complexity (with discussion). Journal of the Royal Statistical Society, Series B 49, 223–239, 253–265. Saul, L. K., and Jordan, M. I. 1995. Boltzmann chains and hidden Markov models. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, eds., pp. 435–442. MIT Press, Cambridge, MA. Saul, L. K., and Jordan, M. I. 1996. Exploiting tractable substructures in intractable networks. In Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds., pp. 486–492. MIT Press, Cambridge, MA. Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6, 461–464. Shachter, R. D., Anderson, S. K., and Szolovits, P. 1994. Global conditioning for probabilistic inference in belief networks. In Proceedings of the Uncertainty in AI Conference 1994, pp. 514–522. Morgan Kaufmann, San Mateo, CA. Spiegelhalter, D. J., Dawid, A. P., Hutchinson, T. A., and Cowell, R. G. 1991. Probabilistic expert systems and graphical modelling: A case study in drug safety. Phil. Trans. R. Soc. Lond. A 337, 387–405. Spiegelhalter, D. J., and Lauritzen, S. L. 1990. Sequential updating of conditional probabilities on directed graphical structures. Networks 20, 579–605. Spirtes, P., and Meek, C. 1995. Learning Bayesian networks with discrete variables from data. In Proceedings of First International Conference on Knowledge Discovery and Data Mining, pp. 294–299. AAAI Press, Menlo Park, CA. Stolorz, P. 1994. Recursive approaches to the statistical physics of lattice proteins. In Proc. 27th Hawaii Intl. Conf. on System Sciences, L. Hunter, ed., 5:316–325. Swendsen, R. H., and Wang, J-S. 1987. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters 58. Thiesson, B. 1995. Score and information for recursive exponential models with incomplete data. Tech. rep. Institute of Electronic Systems, Aalborg University, Aalborg, Denmark. Vandermeulen, D., Verbeeck, R., Berben, L., Delaere, D., Suetens, P., and Marchal, G. 1994. Continuous voxel classification by stochastic relaxation: Theory and application to MR imaging and MR angiography. Image and Vision Computing 12(9), 559–572. Whittaker, J. 1990. Graphical Models in Applied Multivariate Statistics. John Wiley, Chichester, UK. Williams, C., and Hinton, G. E. 1990. Mean field networks that learn to discriminate temporally distorted strings. In Proc. Connectionist Models Summer School, pp. 18–22. Morgan Kaufmann, San Mateo, CA.

Received February 2, 1996; accepted April 22, 1996.

NOTE

Communicated by Andrew Barto and Michael Jordan

Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational Learning, Massachusetts Institute of Technology, Cambridge, MA 02139 USA

Geoffrey E. Hinton Department of Computer Science, University of Toronto, Toronto M5S 1A4, Canada

We discuss Hinton’s (1989) relative payoff procedure (RPP), a static reinforcement learning algorithm whose foundation is not stochastic gradient ascent. We show circumstances under which applying the RPP is guaranteed to increase the mean return, even though it can make large changes in the values of the parameters. The proof is based on a mapping between the RPP and a form of the expectation-maximization procedure of Dempster, Laird, and Rubin (1977). 1 Introduction Consider a stochastic learning automaton (e.g., Narendra & Thatachar 1989) whose actions y are drawn from a set Y . This could, for instance, be the set of 2n choices over n separate binary decisions. If the automaton maintains a probability distribution p(y|θ ) over these possible actions, where θ is a set of parameters, then its task is to learn values of the parameters θ that maximize the expected payoff: X p(y | θ )E [r | y]. (1.1) ρ(θ) = hE [r | y]ip(y|θ ) = Y

Here, E [r|y] is the expected reward for performing action y. Apart from random search, simulated annealing, and genetic algorithms, almost all the methods with which we are familiar for attempting to choose appropriate θ in such domains are local in the sense that they make small steps, usually in a direction that bears some relation to the gradient. Examples include the Keifer-Wolfowitz procedure (Wasan 1969), the ARP algorithm (Barto & Anandan 1985), and the REINFORCE framework (Williams 1992). There are two reasons to make small steps. One is that there is a noisy estimation problem. Typically the automaton will emit single actions y1 , y2 , . . . according to p(y|θ), will receive single samples of the reward from the distributions p(r|ym ), and will have to average the gradient over these noisy values. The other reason is that θ might have a complicated effect on which actions are chosen, or the relationship between actions and rewards Neural Computation 9, 271–278 (1997)

c 1997 Massachusetts Institute of Technology °

272

Peter Dayan and Geoffrey E. Hinton

might be obscure. For instance, if Y is the set of 2n choices over n separate binary actions a1 , . . . , an , and θ = {p1 , . . . , pn } is the collection of probabilities of choosing ai = 1, so p(y = {a1 ..an } | θ) =

n Y

pai i (1 − pi )1−ai ,

(1.2)

i=1

then the average reward ρ(θ ) depends in a complicated manner on the collection of pi . Gradient ascent in θ would seem the only option because taking large steps might lead to decreases in the average reward. The sampling problem is indeed present, and we will evade it by assuming a large batch size. However, we show that in circumstances such as the n binary action task, it is possible to make large, well-founded changes to the parameters without explicitly estimating the curvature of the space of expected payoffs, by a mapping onto a maximum likelihood probability density estimation problem. In effect, we maximize reward by solving a sequence of probability matching problems, where θ is chosen at each step to match as best it can a fictitious distribution determined by the average rewards experienced on the previous step. Although there can be large changes in θ from one step to the next, we are guaranteed that the average reward is monotonically increasing. The guarantee comes for exactly the same reason as in the expectation-maximization (EM) algorithm (Baum et al. 1970; Dempster et al. 1977) and, as with EM, there can be local optima. The relative payoff procedure (RPP) (Hinton 1989) is a particular reinforcement learning algorithm for the n binary action task with positive r, which makes large moves in the pi . Our proof demonstrates that the RPP is well founded. 2 Theory The RPP operates to improve the parameters p1 , . . . , pn of equation 1.2 in a synchronous manner based on substantial sampling. It suggests updating the probability of choosing ai = 1 to p0i =

hai E [r | y]ip(y|θ ) , hE [r | y]ip(y|θ )

(2.1)

which is the ratio of the mean reward that accrues when ai = 1 to the net mean reward. If all the rewards r are positive, then 0 ≤ p0i ≤ 1. This note proves that when using the RPP, the expected reinforcement increases; that is, hE [r | y]ip(y|θ 0 ) ≥ hE [r | y]ip(y|θ )

where

θ 0 = {p01 , . . . , p0n }.

Using EM for Reinforcement Learning

273

The proof rests on the following observation: Given a current value of θ , if one could arrange that α E [r | y] =

p(y | θ 0 ) p(y | θ)

(2.2)

for some α, then θ 0 would lead to higher average returns than θ. We prove this formally below, but the intuition is that if

E [r | y1 ] > E [r | y2 ]

then

p(y1 | θ ) p(y1 | θ 0 ) > p(y2 | θ 0 ) p(y2 | θ )

so θ 0 will put more weight on y1 than θ does. We therefore pick θ 0 so that p(y|θ 0 ) matches the distribution α E [r|y]p(y|θ ) as much as possible (using a Kullback-Leibler penalty). Note that this target distribution moves with θ. Matching just β E [r|y], something that animals can be observed to do under some circumstances (Gallistel 1990), does not result in maximizing average rewards (Sabes & Jordan 1995). If the rewards r are stochastic, then our method (and the RPP) does not eliminate the need for repeated sampling to work out the mean return. We assume knowledge of E [r|y]. Defining the distribution in equation 2.2 correctly requires E [r|y] > 0. Since maximizing ρ(θ ) and ρ(θ ) + ω has the same consequences for θ, we can add arbitrary constants to the rewards so that they are all positive. However, this can affect the rate of convergence. We now show how an improvement in the expected reinforcement can be guaranteed: log

X E [r | y] ρ(θ 0 ) = log p(y | θ 0 ) ρ(θ) ρ(θ ) y∈Y · X p(y | θ )E [r | y] ¸ p(y | θ 0 ) = log (2.3) ρ(θ ) p(y | θ ) y∈Y X · p(y | θ)E [r | y] ¸ p(y | θ 0 ) ≥ log , by Jensen’s inequality ρ(θ ) p(y | θ ) y∈Y =

where Q(θ, θ 0 ) =

¤ 1 £ Q(θ, θ 0 ) − Q(θ, θ ) . ρ(θ)

X

(2.4)

p(y | θ)E [r | y] log p(y | θ 0 ),

y∈Y

so if Q(θ, θ 0 ) ≥ Q(θ, θ), then ρ(θ 0 ) ≥ ρ(θ ). The normalization step in equation 2.3 creates the matching distribution from equation 2.2. Given θ , if θ 0 is chosen to maximize Q(θ, θ 0 ), then we are guaranteed that Q(θ, θ 0 ) ≥ Q(θ, θ ) and therefore that the average reward is nondecreasing.

274

Peter Dayan and Geoffrey E. Hinton

In the RPP, the new probability p0i for choosing ai = 1 is given by p0i =

hai E [r | y]ip(y|θ ) , hE [r | y]ip(y|θ )

(2.5)

so it is the fraction of the average reward that arrives when ai = 1. Using equation 1.2, 1 ∂Q(θ, θ 0 ) = 0 0 ∂pi pi (1 − p0i )   X X 0 p(y | θ )E [r | y] − pi p(y | θ )E [r | y] . × y∈Y:ai =1

So, if

P p0i

=

y∈Y:ai =1

P

y∈Y

p(y | θ)E [r | y]

p(y | θ)E [r | y]

y∈Y

,

then

∂Q(θ, θ 0 ) = 0, ∂p0i

and it is readily seen that Q(θ, θ 0 ) is maximized. But this condition is just that of equation 2.5. Therefore the RPP is monotonic in the average return. Figure 1 shows the consequence of employing the RPP. Figure 1a shows the case in which n = 2; the two lines and associated points show how p1 and p2 change on successive steps using the RPP. The terminal value p1 = p2 = 0, reached by the left-hand line, is a local optimum. Note that the RPP always changes the parameters in the direction of the gradient of the expected amount of reinforcement (this is generally true) but by a variable amount. Figure 1b compares the RPP with (a deterministic version of) Williams’s (1992) stochastic gradient ascent REINFORCE algorithm for a case with n = 12 and rewards E [r|y] drawn from an exponential distribution. The RPP and REINFORCE were started at the same point; the graph shows the difference between the maximum possible reward and the expected reward after given numbers of iterations. To make a fair comparison between the two algorithms, we chose n small enough that the exact averages in equation 2.1 and (for REINFORCE) the exact gradients, p0i = α

∂ hE [r | y]ip(y|θ ) , ∂pi

could be calculated. Figure 1b shows the (consequently smooth) course of learning for various values of the learning rate. We observe that for both algorithms, the expected return never decreases (as guaranteed for the RPP but not REINFORCE), that the course of learning is not completely smooth— with a large plateau in the middle—and that both algorithms get stuck in

Using EM for Reinforcement Learning

275

local minima. This is a best case for the RPP: only for a small learning rate and consequently slow learning does REINFORCE not get stuck in a worse local minimum. In other cases, there are values of α for which REINFORCE beats the RPP. However, there are no free parameters in the RPP, and it performs well across a variety of such problems. 3 Discussion The analogy to EM can be made quite precise. EM is a maximum likelihood method for probability density estimation for a collection X of observed data where underlying point x ∈ X there can be a hidden variable y ∈ Y . The density has the form X p(y | θ)p(x | y, θ), p(x | θ) = y∈Y

and we seek to choose θ to maximize X £ ¤ log p(x | θ) . x∈X

The E phase of EM calculates the posterior responsibilities p(y|x, θ) for each y ∈ Y for each x: p(y | x, θ ) = P

p(y | θ)p(x | y, θ) . z∈Y p(z | θ )p(x | z, θ)

In our case, there is no x, but the equivalent of this posterior distribution, which comes from equation 2.2, is p(y | θ)E [r | y] . Py (θ) ≡ P z∈Y p(z | θ)r(z) The M phase of EM chooses parameters θ 0 in the light of this posterior distribution to maximize XX p(y | x, θ ) log[p(x, y | θ 0 )] . x∈X y∈Y

In our case this is exactly equivalent to minimizing the Kullback-Leibler divergence ¸ · X p(y | θ 0 ) 0 . Py (θ ) log KL[Py (θ), p(y | θ )] = − Py (θ ) y∈Y Up to some terms that do not affect θ 0 , this is −Q(θ, θ 0 ). The Kullback-Leibler divergence between two distributions is a measure of the distance between them, and therefore minimizing it is a form of probability matching.

276

Peter Dayan and Geoffrey E. Hinton

Figure 1: Performance of the RPP. (a) Adaptation of p1 and p2 using the RPP from two different start points on the given ρ(p1 , p2 ). The points are successive values; the lines are joined for graphical convenience. (b) Comparison of the RPP with Williams’s (1992) REINFORCE for a particular problem with n = 12. See text for comment and details.

Using EM for Reinforcement Learning

277

Our result is weak: the requirement for sampling from the distribution is rather restrictive, and we have not proved anything about the actual rate of convergence. The algorithm performs best (Sutton, personal communication) if the differences between the rewards are of the same order of magnitude as the rewards themselves (as a result of the normalization in equation 2.3). It uses multiplicative comparison rather than the subtractive comparison of Sutton (1984), Williams (1992), and Dayan (1990). The link to the EM algorithm suggests that there may be reinforcement learning algorithms other than the RPP that make large changes to the values of the parameters for which similar guarantees about nondecreasing average rewards can be given. The most interesting extension would be to dynamic programming (Bellman 1957), where techniques for choosing (single-component) actions to optimize return in sequential decision tasks include two algorithms that make large changes on each step: value and policy iteration (Howard 1960). As various people have noted, the latter explicitly involves both estimation (of the value of a policy) and maximization (choosing a new policy in the light of the value of each state under the old one), although its theory is not at all described in terms of density modeling. Acknowledgments Support came from the Natural Sciences and Engineering Research Council and the Canadian Institute for Advanced Research (CIAR). GEH is the Nesbitt-Burns Fellow of the CIAR. We are grateful to Philip Sabes and Mike Jordan for the spur and to Mike Jordan and the referees for comments on an earlier version of this paper. References Barto, A. G., & Anandan, P. (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man and Cybernetics, 15, 360–374. Baum, L. E., Petrie, E., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat., 41, 164–171. Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press. Dayan, P. (1990). Reinforcement comparison. In Proceedings of the 1990 Connectionist Models Summer School, D. S. Touretzky, J. L. Elman, T. J. Sejnowski, & G. E. Hinton (Eds.), (pp. 45–51). San Mateo, CA: Morgan Kaufmann. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Proceedings of the Royal Statistical Society, 1–38. Gallistel, C. R. (1990). The organization of learning. Cambridge, MA: MIT Press. Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence, 40, 185–234.

278

Peter Dayan and Geoffrey E. Hinton

Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge, MA: MIT Press. Narendra, K. S., & Thatachar, M. A. L. (1989). Learning automata: An introduction. Englewood Cliffs, NJ: Prentice-Hall. Sabes, P. N., & Jordan, M. I. (1995). Reinforcement learning by probability matching. Advances in Neural Information Processing Systems, 8. Cambridge, MA: MIT Press. Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Unpublished doctoral dissertation, University of Massachusetts, Amherst. Wasan, M. T. (1969). Stochastic approximation. Cambridge University Press, Cambridge. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.

Received October 2, 1995; accepted May 30, 1996.

Fast S i g m o i d a l N e t w o r k s v i a S p i k i n g N e u r o n s Wolfgang Maass InstituteforTheoreticalCompufnSoence, TechnircheUniversitmt Gror, GIOZ,Austria

We show that networks of relatively realistic mathematical models for biological neurons in principle can simulate arbitrary feedfornard sigmoidal neural n e b in away that has previoudv not been considered. This new approach is based on temporal coding b;single spikes (mspectively bv the timine of svnchmnous firin= in ~ o a l of s neurons) rather than on the traditional interpretation of analog variables in terms of firing rates. Ihe resulting new simulation is subrtantially faster and hence more consistent withexperimental resulbabout Ihemuimalspeedof information processing i n cortical neural systems. Asa consequence wecan show that networks of noisy spiking neurons are "universal soomximators" in the sense that thev , can a~oroximate with regard to temporal coding any givencontinuous function of several variables. This result holds for a fairly large class of schemes for coding analog variables by firing times of spiking neurons. This new proposal for the possible organization of computations in nehvorks of spiking neurons systems has some interesting consequences for the type of learning rules that would be needed to explain the selforganization of such networks. Finally, the fast and noise-robust implementation of sigmoidal neural nets by temporal coding points to possible new ways of implementing feedforward and recurrent sigmoidal neural n e b with pulse stream VLSI.

.,

2

..

- .

..

1 Introduction Sigmoidal neural nets are the most powerful and flexible computational model known today. In addition they have the advantage of allowing "selforganiration"via a variety of quitesuccessful Learningalgorithms. Unforhnately thecomputational unitsof sigmoidal neural netsdiffer strongly from biological neurons, and it is particularly dubious whether sigmoidal neural nets provide a useful paradigm for the organization offast computations in cortical neural svstems. lrad~honallyon', vmws the brmg rate of a neuron as the rep-ntatwn 01 an analog vanable m analog computatwns with sp~kmgneurons, ~n part~cular.m thc %~mulahon of s~gmondalneural net5 by spekmg nrumns Neural Computorzon 9 , 2 m (1997)

@ 1997 Ma.rachuselb lnshhlte of Technology

280

Wolfgang Maass

However, with regard tofastcorticalcomputations, thisview isinconsistent with exoerimental data. Perrett el ol. (19821 . . and Tho- and Imbert (19891 . . have demonstrated that visual pattern analysis and pattern classification can be carried out by humans in just 100 ms, in spite of the fact that it involves a minimum of 10 synaptic stages from theretina to the temporal lobe. The same speed of visual processing has been measured by Rolls and others in macaaue monkevs. Furthermore thev have shown that a sin& cortical area involved m vl,ual proce,wn): can complcle !b compulaliun in lust 20 to 70 ms (Rolls 1994. Rolls and Tovee 19Y4) On the other hand, the firmg rates of neurons involved in these computations are usually below 100 Hz. and hence at least 20 to 30 ms would be needed iust to samole the current firing rate of a neuron. Thus a coding of analog variables by firing rates is quite dubious in the context of fast cortical computations. Experimental evidenceaccumulated during the past few years indicates thatmany biologicalneuralsystemsuse the timingofsingleactionpotentials (or "svikes") to encode information (Abeles et al. 1993: Bialek and Rieke 1992, b a r
..

" .

Fast Sigmoidal Networks via Spiking Neurons

281

function (RBF) units, where the weights of the RBF units are stored in the delavs between svnaoses and soma of a s ~ i k i n neuron. e Hence Hoofield's , consmctlonprovtdesaneffrc~entway of mplementrnga look up tablc wrth spnkmg neurons (wlth rome very nice mvanance wgardm): the strength of the -tnmulud Ilowe\er, m contrast lo thcconstmchon consdered hew, hrs system rs b a r d on ' grandmother neurons,' and rt ~ s n ogeared t toward prov~dmganmfomat~veoutputm a situatjon where themput (sl s,) dnot match (UD s it is. . to a factor)one of the fixed set of stored ~ i t t e m(because for example, a superposition of several stored patterns). Furthermore Hopfield's construction provides no method for simulating multilayer neural no compunets. In addition, in contrast to our construction, it tational or learning-related role to the efficacy (i.e. strength) of synapses between biological neurons. We describe in the remainder of this section the orecise models for sie" moadal neural ncts and nosy sprkmng ncunms that we cons~der,and at the end of thrs sectLon describe the key mechantsm of our smulatron Ihe mam construction of this article is eive" in Section 2. and our main result is stated . Section3 weshow that this result in the theoremat the end of that s ~ t i o nIn implies that networks of noisy spiking neurons are universal approximaton. Wealso move that this result holds for a fairlv > lareeclass of schemes for temporal coding of analog variables. In Section 4 we briefly indicate some new perspectives about the organization of learning in biological neural svstems &at follow from this a&roach. ' We point out that this is nocan article about biology but about computational complexity theory. Its main results (given in Sections 2 and 3) are rieorous theoretical results about the com~u&ionalDower of common mathrm~t~cal models for networks of spkmg neurons However, some Informal comment\ have bwn added (after the thcorrm In %dwn 2. as well as in Sections 4 and 5) in order to facilitate a discussion of the bioloeical relevance of this mathematical model and its theoretical consequences. The computational unit of a sigmoidal neural net is a sigmoidal gate (o-gate) G, that assigns to analog input numben XI.. .. x,_l 6 [O. y] an output of the form'= x o ;:( r, .x, + r.). The function o: R + [0, y] is called the activation function of G. n . . . .. ""-1 are the weiehts of G.. and r... is the bias of G. These are conside;ed adjustable paramete; of G in the context of a learning process. The parameter y > 0 determines the scale of the analog -. com~utationscarried out bv the neural net. For convenience we assume that each G-gate G has an additional input x, with some constant value c E (0, yl available. Hence after rescaling r., the function , fc- that is com~utedbv , G can be viewed as a restriction oTthe function

. .,

.

.

-

-

.

.

Wolfgang Maass

282

to arguments with x. = c. The original choice for the activation function a in Rumelhart et ol. (1986) has been the Logistic sigmoid function o(y) = 1/(1 + 0'). Many years of practical experience with sigmoidal neural nets have shown that the exact form of the activation function o is not relevant for the computational power and learning capabilities of such neural nets, as Long as a is nondecreasing and almost everywhere differentiable, the Limits Limy,-, o(y) and limy,,o(y) have finite values, and o increases approximately linearly in some intermediate range. Gradientdescent learning procedures such as backpropagation formally require that o is differentiable everywhere, but practically one can just as well use the piecewise linear "Linear-saturated activation function n,: R + [O, y] defined by n,(y)=

I

0. y. Y

ifyy.

As a model for a spiking neuron we take the common model of a Leaky integrate-and-fire neuron with noise, in the formulation of the somewhat more general spike response model of Gershler and van Hemmen (1994). The onlv specific assumption needed for the consbction in this article is that po&+aphc potent;als can be dexrtbed (or at least approxmated) by a llnear funchon d u m g some rnrhal segment Actually theconstmchons of thtb art& appear to be of ~nteresteven ~f this asumphon 15not sattsfied, but in that case thev are harder to analvze theoreticaliv. We consider networks that consist of a finite set V of spiking neurons, a set E G V x V of svnapses, . . a weizht w . , > 0 and a response function E,. R* + R for each svnaose (u. v ) E E (where RC := Ix E R: x > 01). and , a thmhold funchon (3, R t + R' for each neuron u c V Each responsefunctmnc, ,modelsetther an EPSPor an 1151' lhr typral ahaw of EEPs and lPSPs IS md~caledan F~eure1 F. c R+ is the set of firing times of akeuron u, then the potential at the higger zone of neuron v at time t is given by

.,:

.

if

Furthermore one considers a threshold function Q,(t - t') that quantifies the "reluctance" of v to fire again at time t if its Last previous firing was at time P. Thus (3&) is extremely large for small x and then approaches 0,(0) for larger x. In a noise-& model, a neuron v fires at time t as s w n as P,,(t) reaches @At- P). The precise form of this threshold function (3, is not important for the conshuctions in this article, since we consider here only computations that relv on the timinz of the first spike in a spike hain. Thus it suffices to assume that O,(t - t') = 0,(0) for sufficiently large values o f t - t' and

Fast Sigmoidal Networks via Spiking Neumns

283

Figure I. The lypncal shapeof mhtbalory and exotalory postsynaphc polenlials al a bmlo@calneuron (We assume that the resnng mcmbrane pomennal has thc value 0.)

that mfI(3,W x E (0 YII IS larger than the polenhals P,UJ that occur m the construction of Section 2 lor t t [To.,- y . T ,,] The latter condihon (which amounk to the assum~tionof a sufficientlv lone refracton, oeriod) will prevent iterated firing of neuron u during the critical time interval [Tag - y , Tml. The construction in Section 2 is robust with respect to several types of noise that make the model of a spiking neuron biologically more realistic. As in the model for a leaky integrate-and-fire neuron with noise, we allow that the potentials P,(t) and the threshold functions <&(t- Y) are subject to some additive noise. Hence P,(t) is replaced by

-

PySY(t):= P"(t)

2 .

.

+ ol"(t).

and O,(t - t') is replaced by

where a&) and &(t) describe the impact of some unknown (or even adversarial) source of noise (which might, for example, result from synapse failures). One assumes in most previous theoretical studies that a&). &(t) are distributed accord in^ to some s-ific vrobabilitv distribution (e.e., white . noise), whereas our subsequent constructions allow that a,(t). &(t) are some arbitrary functions with bounded absolute value (e.~., - "systematic . noise"). Ina simpler model for a noisy spiking neuron, one assumes that a neuron v fim exactly at those time points t when P ( t ) reaches from below the value @Yg(t- 1'). We consider in this article a biologically more realistic

-

.

.

-

Wolfgang Maass

284

model, whereas in Gerstner and van Hemmen (1994), the size of the difference P ( t ) - C)yy(t - t') governs just the probability that neuron u fires. The choice of the exact firing t i m e is left up to some unknown stochastic processes, and it may, for example, occur that u does not fire in a time interval I duringwhich e Y ( t ) @ y Y ( t -t') > 0,or thatv fires "spontaneously" at a time t when e Y ( t )- C)Y(t - t') c 0. For thesubsequent constructionswe need only the following assumption about the firing mechanism: For any time interval I of length greater than 0, the probability that v fires during I is arbitrarily close to 1 if P'Yy(t) 0YY(t t') is sufficiently large for t E I (up to the time when u fires), and the probabilitythatv firesduring !isarbitrarily c l o s e t o ~ o f ~ ~ ( t ) - @ y y ( t - t ' ) is sufficiently negative for all t E I. It turns out that it suffices to assume only the following rather weak propertiesof the response functionsc,, ,,:Eachresponse functionE,,,: Rf R is either excitatory or inhibitory. All excitatory response functions e.,,(t) have thevalueOfor t E 10, d,,,l,and thevalue t d , . for t E Id,,, d,,+Al, where d,,, ? 0 is some fixed delay and A > 0 is some other constant. Furthennore we assume that E ,,, (t) ? c..,(d ,,., A) for all t E [d,., A.d,, + A y], where y with 0 c y 5 A12 is another constant. With regard to inhibitory response functions E. .(t), we assume that ~,,,,(t)= 0 D - du.d fort E Id,,.". du." A]. Furthermore fort E 10, d,.,] and E ~ , ~=( -(t we assume that E,, .(t) = 0 for all sufficiently Large t. Finally we need a mechanism for increasing the firing threshold C) := @do)of a "rested" neuron v (at Least for a short ~eriod). One bioloeicallv plaus!bleabsumptmn that would account for buchan ulcreasets that neuron u m e w o b a large number of iPSh from randomly fmngnruruna that a r r w c onsynapses that are far away from the triggerzoneof vTso that eachof them has barely any effect on the dynamics of the potential at the trigger zone, but together they contributea rather steady negative summand BN- to the potential at the trigger zone. Other possible explanations for the increase of the firine threshold (3 could be based on the contribution of inhibitow mterneurons whose IPSPsarr~vecloseto lhesoma and are t m c locked tothe onset of the stimulus, or on long-lasl~ng - ~nhnbittonssuch as those mcdmted by GABAs receptors. Formally we assume that each neuron v receives some negative (i.e., inhibitory) potential BN- c 0 that can be assumed to be constant during the time intervals that are considered in the followine areuments. In comparison with other models for spiking neurons, this model allows more general noise than the models considered in Gerstner and van Hemmen (1994) and Maass (1995). On the other hand this model is somewhat less general than the one cons~deredIn Maass (199ha) H a v q deftntd the formal modcl, wr can now cxplam the key mechanism of ihe constructions in more detail. It is well known that i&oming

-

+

+

+

+

-

--

.

Fast Sigmoidal Networks via Spiking Neurons

285

EPSPs and I E P s are able to shiff the firing time of a biological neuron. We explore this effect in the mathematical model of a spiking neuron, showing that in principle it can be used to carry out complex analog computations in temporal coding. Assume that a spiking neuron u receives PSPs from presynaptic neurons a,. .. . , a n , that w, is the weight (efficacy) for the synapse from a, to u, and that d; is the time delay from a, to v. Then there exists a range of values for the parameters where the firing time L of neuron v can be written in terms of the firing times t,, of the presynaptic neurons a, as

Hence in principle a spiking neuron is able to compute in temporal coding of inoutsand oubutsalinear function (where theefficaciesof svnaosesencode , the coeff~cients of the h e a r function, as in rate codnng of analog varmbles) Thecalculations at the beztnntnz of Sectnon 2 show that thrs holds pmrwly are at time t,, all in their initial liniarlv rik if there is no noise and then P S P ~ ing or linearly decreasing phase. However, for a biological interpretation, it is interesting to know that even if the firing times to, (or more precisely their effectivevaluest., + d i ) lie further apart, this mechanism computes a meaningful approximation to a linear function. It employs (through the natural shaoeof BPS) an interesting adaotation of outliersamonp. the t, +d;: Input neuronsa, that fire t w late (relative to the average) losetheir influenceon the determination off,, and input neurons a, that fire extremely early have the same imoact as neuronsa, ihat firesomewhat later (but stilibefori the average) Remark 2 ~nScctnon 2 provtdesa m o r e d e t a h i dtscuss~onof thtsrffecl Thegoal of the next srctton IS to provca ngorous theoret~ralresult about theco&utational oower of formal models foinetworks of soikine neurons. We are not claiming that this construction (which is designed exclusively for that purpose) provides a blueprint for the organization of fast analog comoutations in bioloeical neural svstems. However.. it aorovides the first theoretical model that is able to explain the possibility of fast analog computations with noisy spiking neurons. Some remarks about the possible biological relevance of details of this construction can be found after the theorem in Section 2.

.

" .

.

-

2 The Main Construction

Consider an arbitrary +-gate G, for some y > 0, which computes a function fc: [O. Y ] " --t 10, Y ] . Let rl. . . . r. E R be the weights of G. Thus we have rl . s, < 0 if

.

fc(s1,.

.. . s n ) =

I1

E:=,r, .s,,

.

for arbitrary inputs sl.. . . s. E [O. y ] .

r:=l

if 0 5 if

EL, I# .

r, .s, 5 y S,

r y

Wolfgang Maass

286

Figure 2: The simulation of a sigmoidal gate by a spiking neuron v in temporal coding

-

.

.

For the sake of simdicitv * we first consider the case of mikine " neurons without noise (i.e., u, = 8, 0 and each neuron v fires whenever Po(!) crosses Q,(t - P ) from below). Then wedescribe thechan~es that areneeded in this construction for the general case of noisy spikingneurons. We construct for a given n,-gate G and for an arbitrary given parameter E > 0 with E c y a network Nc.* of spiking neurons that approximates fc with precision 5 E ; that is, the output NG.,(sI.. .. ,s,,) of NG., satisfies INC.&I.. .. .sn) - fc(s1.. .. .s.)I j E for all sl. . . . s. E [O, y ] . In order to be able to scale the size of weights according to the given gate G.we assume that h% ,receives an additional inout sn that is eiven like the other input variables sl.. . . s. in temporal coding. Thus we assume that there are n 1 input neurons ao.. .. o, with the property that a, fires at time Ti. - s, (where T.. is some constant). We will discuss at the end of this sectio"(in Remarks 5 a n d 6) biologicaliy more plausible variations of the construction where ao and T , are not needed. that receives n+l PSPs ho(t),.. , Weconshucta spiking neuron v in NcGG h.(t) from the n 1input neuronsao.. .. an, which result from the firing of a, at time T,. - s, (see Fimre 2). In addition v receives some auxiliarv PSPs from other spiking neu& in N c ,, whose timing depends only on

.

.

+

+

. -

.

.

-

.

Fast Sigmoidal Networks via Spiking Neurons

287

The firing time t, of this neuron v will p r o v i d .. . ,~).s of the network NG.,in temporal coding; that is, u will fire at time T, NG.&I.. . . s.) for some Toutthat does not depend on SI.. . . ,s.. Let w.,,, be the weight of the synapse from input neuron a, to neuron u, i = 0,... ,n. We assume that the "delav" , do,.-,,behveen a, and u is the same for all input neurons ao.. . . an, and we write d for this common delay. Thus we can describe for i = 0... . n the impact of the firing of a; at time T,. - s, on

.

.

. --

the potential at the trigger zonebf neuron u at t h e t bv the EPSP or IPSP h,(t) = w.,,,. ~ , . ~-((T,,, t - s,)), which has on the basis of our assumptions the value if t - (T>,,- s,) < d h i ( t ) = ( ~ E . ( t - ( T , . - s , ) - d ) , ifdjt-(T,.-s;)jd+A. in the case of an EPSP and w, = -w.,,, in the case of an where w, = w.,,, IPSl? Weassume that neuron v hasnot fired for a sufficientlv , lone- time.. so that its thmhold function B,(t - Y) can be assumed to have a constant value W when the PSPs ho(t). . . . h.(t) arrive at the trigger zone of u. Furthermore weassume for themoment thatbesides these n j i PSPsonly BN- influences the potential at the trigger zone of u. This contribution BN- is assumed to have a constant value in the time internal considered here. Then if no noise is present, the time t, of the next firing of u can be described by the equality

.

h,(t,)

(3 = ;=a

+ BN-

w, . (t, - (T,. - s;)

=

- d) + BN-.

(2.1)

'4

provided that

We assume from now on that a fixed value so input so. Then equation 2.1 is equivalent to

This t, satisfies equation 2.2 if -s, hence for any sj E [0, y ] if

.

j 1,

= 0 is chosen for the extra

- T,. - d

j

.

A

- s, for j = 0,. . . . n;

We set w, := A . r, for i = 1 , . . . n, where rl, .. . r, are the weights of the simulated =,-gate G, and A > 0 is some not-yet-determined factor (that we

288

Wolfgang Maass

will later choose sufficiently large in order to make sure that neuron v fires closely to the time 1, given by equation 2.3 even in the presence of noise). We choose wo so that x:=o wi = A. This impl~esthat equation 2.3 with To,,,:=

<-) - BN-

+ T," + d

is now equivalent to t" = T m -

C h .EL. ,=I

and equation 2.4 is equivalent to 05--

O-Y-

g7,s,jA-y.

(2.6)

Hence, provided that y , A, and BN- are chosen in relationship to 6) and A SO that (3 - B N' ~ 5 -

4A-Y.

(2.7)

we havesatisfied equatian2.4, and thereforeachieved that the firing time t, of neuron uprovides ln temporal coding theoutput f c s . . . ,s = Z:=, r,. s,ofthesimulatedn,-gateCforallinputsst.....s, E [0, y]withx:=, r,.s, E 10. YI. ' In order to simulate G also for sl. . . . s,, E 10, y] with other values of x:=, r, - s,, we make sure that u fires at a time with distance at most E to To., - y if x:',r, . s, is larger than y and that v f i r e with distance at mast E to time T,, if E:=, r, . s, c 0. For that purpose, we add inhibitory and excitatory neurons that fire at times that depend on time T,, but not on the inputs,. . . . s,. Theactivity of theseauxiliary inhibitolyand excitatory neurons may also shift the firing time t, of u if x:=l r, . s, E [O, y], but at most by E. According to the previously described construction, we have by equation 2.5 that t., = T,...+ if r. = 0 for i = l . . ..n (and therefore T!-. -,-, r . . s. = 0). Furthermore the parametera have been rhown so that equation 2 2 rs satls. f l e d for thts caw, whch ~mplws thnt each of the PSPs h , O ) 15 at time T.,,for E, = Ostill within theinitialsegment of lenath A of its first nonzeroseament. This implies that for any value of s, E LO. G] the PSP h;(f) is at time T;, - y not further advanced than the end of the initial segment of length A of its

.

.

.

Fast Sigmoidal Networks via Spiklng Neurons

289

for all t 5 To., - y and all s, E [O, y]. This implies that for any st.. . . .s. E 10. y1 (even if x:=, r1. 4 d [O. Y]) we have Ix:loh,(t)l 5 W . A for all t 5 To, - y, where W := x:l=o Iw,l. Hence in order to prevent a firing of u before time To,,, - y, it suffices to employ auxiliary neurons in NG, that send IPSPs to u that lower the potential at the trigger zone of v during the interval [To,,,- y - A. To,,, - y] by more than W . A BN- - (-). In general, these lPSPs also influence the potential at the higger zone of u shortly after time To,, - y. Since these auxiliary lPSPs will be independent of the input a,. . . ,s,, they may delay the firing of u even when T:', r, . s, E 10. . yl. . In order to aooroximate the gwen n,-gate G w,ith an ermr of at most F , wc a9sume that t h ~ aux~lmry e IPSPs havcvanlbhedat the lrrggrr zoneof uby tmeT,,., - y T > It I S O ~ V ~ U I that all t h c r cond~tronscan be achwed wrth a sufftcwnt number (and/or weiehtsl of auxiliarv IPSPs from inhibitorv neurons in N P . ,svnchronized , whose firing time depends only on T,,,. Thiscan be satisfied using only the very weak condition that each IPSP is continuous and of value 0 before it evenhlallv vanishes (see the omise condition in Section 1). We now want tomakrwre that u fire+at the latest by ttme T,,., - y + r ~f by tmeT,,, - y r r . r c ,y r S,ncvallauxtltaryII'Sl'shave~~an~shed ie-have P,,(T,,,, - y &) = x:=oh,(~,.l - y E) B ~ - . . ~ e n cite suffices to show that the latter is 2 (-1. Consider the set I of those i E (I.. . . n) with r, > 0, that is, those i where w,,,, 8,,, ( t ) represents an EPSP. By choosing suitable values s: in the interval [O.s,l for i E I and by settings: := s, for i E ( I . . .. n) - I , one can achieve that EL, r; . s: = y - E. According to equation 2.5, the potential P,, reaches the value 6) at time To,,,- y + E for the inout (s:. and 2.7. each PSP . . , . . s:,), ,,. and accordine to eauations 2.2.2.4.2.6. w,.,, .E.,,, is at time T,,,,,- Y +& still within A of the beginning of its nonzero phase. If we now changes: tos, for i t I, the EPSPw,,., .e,,.,, will be advanced time by s, - s: E [o.;]. Hence each of the EPSPs w.,,, . F.,,, for i E I is for inputs, at time To,,,- y + e within A + y of the beginning of its rising phase. Since we have assumed in Section 1 that w.,,, . e,,,,(t) z w.,,,,. E, (d + A) for all t E Id A, d A vl, . . P,,(T,,,+ . . -v . cl has for inout h. . .. . s,,).. a value that is at least as large as for input (s;. . . . .s,), and therefore a value 3 (3. This implies that in the case E:=, . r, . s, > y - e the neuron u will fire within the time interval [T,,,, - y , T,,,, - y + E]. In an analogous manner we can achieve with the help of EPSPs from auxiliary excitatory neurons in NcI (whose firing time depends only on T,,, not on the input valuesst.. . . ,s,,)that neuron v fires at the latest at time r, . S, < 0. The preceding analysis implies that for any To,,,,even if values, E 10, yl and any t 5 To,,,,the absolute value of the PSP w,,,, . E.,.,.(11 ,,(t)I: i E 10.. . . nl and can be bounded bv Iw.. z.I - 0. where 0 := SUDIIE,, t E Id, d A y.1) is'the maximum absolite vhue that any e.,,,,(t) for i = 0.. .. ncanreachduringtheinitialsegmentoflengthA+y ofitsnonzero segment. Thus if the absolute value of thhse functions e., ,St) grows during [d+A. d+A+y] not faster thanduring their linearsegment [ d , d+A] (which

+

..

" .

-.

~

+

r:-,

.

+

. ..

+

+ +

.

- .

+ +

+

.

.

r:=,

+ + .

.

Wolfgang Maass

290

Figure 3: The resulting activation function n of the implementation of a sigmoidal gate in temporal coding.

apparently holds far biological PSPs), we can set p := A + y.Consequently we can derive for any I 5 T,,,, and any values 51. . . . s, t [O,yl the bound I x:loh,tt)l 5 Iw,.,.~ - p = W - p. Thus it suffices to make sure that EPSPs from auxiliary neurons in NG,,reach the trigger zone of neuron u shortlv after time T,,., - e and that their sum reaches a value C ) - B N + W. o by t~meT~,,.,.Then for any valuesof $ 8 . . s, E 10.y l thepotent~alP,( 1 , will reach the value <-)atthelalest by tmeT,,,.cau,mga brmgof v by t m e To,., Thesc auxrhary EPSPs w l l ~ngeneral h a w the r ~ d effcct c that they shghtly r i,t 10.r 1 the fmng t ~ m c advance for mputs st.. . s,, c 10.y 1 w t h t,., but the fnrmg t m e wdl m y whhm the mterval Ir,,., - P , r,.,] Accordxng lo our absurnptron m Sectron 1. infl(-I,(xl x E (0,yll is so large that v cannot fire more than once during [To,,,- y,To,,]. Hence our - y.To,,]for construction makes sure that u fires exactly once during [Tout any inputs sl.. . . s, E LO. yl. In the following discussion we will denote . . s,). this firing time t,. of u by To,,,- NG.,(~I,. We have now shown that under theassum~tionthat no noise influences the firing time I, of neuron u, we have lNc,(& . . . s,) - f c h . .... $,,)I 5 e for all s1. . ... s,, E 10.yl.If one plots the variable y = Nc.,(sl. ...s,) that neuronuoutDutsintem~oralcadineasa functionof Y!L, r . . s,.onesees that .. the effective "activation function" o of this implementation of t h e n - ate G has the form indicated in Figure 3. Like the piecewise linear acti
.

x:=,

z:,

.

.

.

-,-.

sate

Fast Sigmoidal Networks via Spiking Neurons

291

v will be arbitrarily close to the previously considered deterministic firing time t, of neuron v, with probability arbitrarily close to 1. Assume that some arbitray &.s > 0 are given. In the preceding construction,we had chosen values w, := A. r, for the strengths of the synapses to neuron v, where A > 0 was some not-yet-determined factor. According to equation 2.3, a change in the value of this parameter A can result only in a shift of t, by an amount that is independent h.om the input variables $1,. . . s.. Furthermore if one chooses the contribution BN- of inhibitory background noise as a function of A so that Q - EN- = A. Z for some constant ?, the resulting firing time t, is completely independent of the choice of A. On the other hand, the parameter A occurs as a factor in the non. r, . la.. . d t - (T,. -$))I = zy=oh,(t) in the potential constant term h,(t) +EN- (we ignore the effectof auxilary PSPs for the moP&) = ment). Hence by chwsing A sufficiently large, one can make sure that in the deterministic case h ( t ) has an arbitrarily large derivative at the time t, when it crosses O,(f - t'). Furthermore if we choose y. A, and BN- so that equation 2.7 can be replaced by

.

x:=,xy=,A

then it is guaranteed that the linearly increasing potential Pdt) (with slope proportional to A) will rise with the same slope throughout the interval Itn - y , t, E]. Ifwenow k e e thissettine ~ of themrameters but reolacethedeterministic neuron u by a stochastrc neumn u that rs subject to the two types of nolse that werespeofd mSect~on1. thefollowmgcan beobserved If o(t)l 5 o and 16(t)1 . -2.6 for all L then the time intervaiaround t.,- durine which Pdf) . . is within the interval [O - ol - 0. O + a + ,4] becomes arbitrarily small for sufficiently large A. Furthermore if A is sufficiently large (and EN- is adjusted along with A so that = Z for some constant ? z 0 that is independent of A), then durine IT.,,, - v. R ( t ) + a + 6 - (9 is arbitrarilv , neeative ~, . t,.. - el.. Thus the pmhah~htythatvferesdur~ng[T,.,-y, I, -r]can behroughtarbttrardy close to 0 In addrt~on,by making A sufficwntly large, one can achwvr that P,.(t)a - 6 - O is arbit&lv larie throunhoit thetime interval It. . . + s12, . t,, + 81. ~ e n c the e probabiliGtha& fires b; time t, + E can be brought arbitraril; close to 1. If one increases the number or the weights of the previously described auxilary PSPs in an analogous manner, one can achieve for the noisy neuron fires model that with probability 2 1 - S the neuron v of network NG,,~ exactly~nceduring[T~-y, TTT,l,andatatime;" = Tout-N~.r.a(s~. . . . .a,l) with l N ~ . ~ . s (. .s.~. a. n ) - fc(s1,. ...snl 4 2e, f o r a l l s ~.. . . .s t [O, yl.

+

-

.

-

.,.

.

292

Wolfgang Maass

The previously described construction can easily be adapted to allow shifts in the arrival times of auxiliary PSPs at v of u p to &/2.Hence we can allow that these PSPs also come from noisv soikine neurons. In order to simulate an arbitrary given feedforward network N of n,gates with precision e a n d probability? I - S of correctness, oneapplies the e i v & ~ .6 > 0 &xedineconstruction sevaratelv,toeach eate G in N. For anv, ~, one detcrminn for each n, -gate (; m N (startmg at the output gates of N) su~tablevaluesr,..6r, ,0,aothat turpproumatt,N wrthmt w~thprobabiltly 2_ withii d ~ E with C - 1 - 6, it suffices that each gate G i n N is a ~ ~ r o x i m a t e probability E 1-' 6 by a networkNc -8, of noisy spiking neurons. In this ~ spiking neurons that way, one can achieve that the n e t w ~ r k N ~ ,of, , noisy is composed of these networks Nchlc,sGapproximates in temporal coding with probability 2_ 1 - S the output of N within c, for any given network input X I . . . . x,~,E [O. yl. Actually it would be more efficient if the modules N G . , ~ are . ~ not disjoint but share the neurons that generate the auxiliary EffiPs and IPSPs. Note that the oroblem of achievine reliable dieital and analog romputatmn w t h nwsy neurons ha, already brrn conrldrwd m von Neumann (1956) for other t) pes of t o m a l neuron models Thus we have shown:

,.

-

.

..

.

Theorem.Foranygiuenp.S > Oonecansimulateanygivenfeedforwardsigmoidd neuralnetNconsistingofn,-gates(forsomesyqiomtlysmaNy thatdependson the chosen model for a spiking neuron) by a n e t w o r k N ~s. ~ of noisy spikmg neurons in temporal coding. More precrsely, for any network input XI. . .. x., E [0,y ] the output of NN,,,~ differs with probability z 1 - 6 by at most &from that of N. Furthermore the computation time ofNq, s depends on nerther the number of gates in N nor theparameters s, S, but only on the number oflayers ofthesigmoidal neural network N.

.

Remark 1.Onecan exploit the concreteshape of PSPs in biological neurons in order to arrive at an alternative approximation of sigmoidal neural n e b bvspikinr! neurons in tem~oralcodinawtthoutsimulatinr! ex~licitlvtheactivation function n, of the sigmo~dalneural net. In this alternative approach, one can delete from the previously described construction the auxiliary neurons that prevent a firing of neuron v before time To,,,- y and force it to fire Furthermore in this interpretation we no longer at the latest by time Tun,. have toassume that the initial linear segmentsof all incoming PSPsoverlap, and hence less precision in the firing times is needed. In case v fires substantiallv after time T,,,,,. the resultine PSP at a ~ o s t s y n a p t ~neunm u' ( a h x h smulatrs a stgmuddl gate un the next layer of the srmulated s~gmoldalneural net) stdl has 11stntt~alvalue 0 at the t,me t, whrn z3 f ~ r e sDually, 7,fnrrsbefore tune T,,,, - y . t h e r r d t t n ) : PSPmay be at trme I,. already near tls saturatmn value (where 11 mcreases or dccreast% mon.slowly than d u r q 11s~nnr~al "lmear phase") Thus In e ~ t h ecr a s . the concrete functronal form of EPSPs and IPSPs tn a bmlog~ialnrurun u' n d -

..

- .

.

Fast Sigmoidal Nehvorks via Spiking Neurons

293

. , .

ulates the inout from a oresvnaotic neuron u in a wav that corresoonds to the application of a saturating sigmoidal activation function to the output of v in temporal coding. Furthermore. largerdifferences than y between the firine timk of oresvn&tic neurons u &be tolerated in this context This implicit implementationof an activation functionis, however, mathematically imprecise. A closer look shows that the amount by which the o u t ~ uofv t in temooral coding depends on thesize of the ournut - is adiusted , of u relative to the size of the outputs of the other preynaptic neurons of v' (in temporal coding). The output of u is changed by a Larger amount if it differs more strongly from the median output of the other presynaptic neurons of u' (but it is always moved in the direction of the median). This context-dependentmodulation of the output of v is hard toexploit for precise theoretical results, but it appears to be useful for practically relevant computations such as pattern recognition. Thus we arrive in this way at a variation of the implementation of a sigmoidal neural net, where two biologically dubiouscomponents of our main conshmction (the auxilian, neurons that force u to fire exactlv once durine IT,.,, - y . ' I : , , , ) , as well as the rcqummrnl that the initml h e a r segments of all relevant E P s have to overlap) are deleted, bur the result~ngnetwork of spiking neurons can still carry out complex and practical multilayer computations in temporal coding.

.

, .

-

Remark 2. Our constructionshows as a soecial case that a linear function of g ?for real-valued inputs - ( 5 , . . . s,,) and a stored vector the form 1 can be computed very efflc~ently(and bery qurckly) by a g = (w. .am,,) spiking neuron v in temporaimding. In this case;no auxi~i~ry~neu&s are needed. Quick computation of linear functions is relevant in many biological contexts, such as coordinate transformations between different frames of reference or the analysis of a complex stimulus g in terms of many stored patterns g.g'.. . . For example, in an olfactory neural system (see, e.g., Hopfield 1991, 1995) the stimulus 2 may be thought of as a superposition of various stored basic odors w. w'. . . . . In this case the oubut v = w - s of neuron u in temporal codingmay be interpreted as the an&& byw6ch the basx odor w (which is stored in the efficacis of the synapses of u ) is present in the stimulus 2. Furthermore another neuron on the next layer might recewe as its input y = (y. y'. . . .) from several such neurons u, v'. . . . : that is, B receives the "n&ing proportions" y = g - & d = g' g for various stored basic odors g,g',. . . in temporal coding. This neuron B on the second layer can then continue the patternanalysis by computing for its input y the inner product W .y with some stored "higher-order pattern"H (e.g.,the compositionof basic odorscharacteristic for an individual animal). Such multilayer pattern analysis is facilitated by the fact that the neurons considered hereencode their output in the same way in which their input is

.

.

Wolfgang Maass

294

encoded (in contrast to the approach in Hopfield 1995).One also gets in this way a very fast implementation of Linsker's network (Linsker 1988) with spiking neurons in temporal coding. Remark 3. For bioloaical neurons it is imvossible to choose the varameter A arbiharily large.0n'ihe other hand, the&periments of ~ r ~ a n t a Segundo nd (1976). as well as the recent experiments of Mainen and Sejnowski (1995). s u a e s t that bioloaical neurons already exhibit very hizh precision in their fir& times if the slope of the membrane potential',(6 ai the time t when it crosses the firing threshold is moderately large. Remark 4. Analog computations with the type of temporal coding considered here become impossible if the jitter in the firing times is large relative to the size of the range [0, y ] of analog variables in temporal coding. The following points should be taken into account in this context:

.

Even if the jitter is so high that in temporal coding just two different outputs can be distinguished in a reliable manner, the computational power of the constructed network N of spiking neurons is still enormous from the point of view of computational complexity theory. In this case, the network can simulate arbihary threshold circuits (is., multilaver DerceDhOns whose eates eive binarv o u b u t s ~verv fast. Threshold crrcults are cxlrvmely powerful models for parallel dlgrtal computntmn, whrh ran (m contrast to PRAM, and other models for cur&tly available computer hardware) compute various nonhivial boolean functions that depend on a large number n of input bits with polynomially in n many gates and not more than four layers (Johnson 1990; Roychowdhury et al. 1994; Siu et al. 1995).

, . .

~. ~.

3

.

.

Formallv the value of v in the med dine construction was reouired to be very small: it was bounded by a fraction of the length A of the rising segment of an EPSP (see inequalities2.7and 2.8). However, keep in mind that we were forced to resort to such small values for v oniy bccaur we wanted toprovea rlgorous theoretrcal rrsult.'lhecons~deratwns In Remark 1suggest that in a prurttrulconte~t.mecan still carry out meaningful comp&tions if the linearly rising segments of incomine EPSP aresvread out over a somewhat lareer time interval than A. where they need no longer overlap. If one wants to compute specific values for the parameters y and A, one runs into the problem that the value of A va& enormously amona different biolo&l neural svstems, from about 1 to 3 ms fir EPSPS-resulting from ~ M P A mep& in cortex to about a second in dark-adapted toad photoreceptors

.

There exists an alternative interpretation of our construction in a biological neural system that fmuses on a smaller spatial scale. In this interpretation, a "hot spot" in thedendritic t m of a biological neuron

Fast S~gmoidal Networks via Spiking Neurons

295

assumes the role of a spiking neuron in our construction. Hot spots are patches of membrane with voltage-dependent channels that are known to occur at branching points of the dendritic tree (Shepherd 1994; Jaffe et al. 1992; Mel 1993; Softky 1994). They fire a dendritic spike if the membrane potential reaches a certain threshold value. From this point of view, a single biological neuron may be viewed as a network of spiking neurons, which according to our construction, can simulate in temporal coding a multilayer sigmoidal neural net. The timing precision of such circuits is possibly very high since it does not involve synapses (see also Softky 1994). Furthermore in this interpretation the computation is less affected by possible failures of synapses. In another biological interpretation one can replace each neuron v in

NN,6 by a pool P,. of neurons (as in a synfire chain; see Abeles et al 1993).The firing time t,, of neuron v is then replaced by the mean firing time of neurons in P,,, and an EPSP from u is replaced by a sum of EPSPs from neurons in PC,.This interpretation has the advantage that it is less affectedbv, iitter in the firine times of individual neurons and by stochast~cfadures of ~nd~vrdual synaprs Furthcrmorc, the rjbml: begmen1 of a ,urn of EKPs from P, IS hkoly to be subslanttally lonwr than that of an individual EPSP. Hence in this interpretation. the parameter y can be chosen substantiallylarger than for singleneurons.

.

Remark 5. The auxiliaryneuronan that provides in theconstruction a reference spike at time T,, is not really necessary Without such auxiliaryneuron ao, theweights w, of the invutss; are auto&icallv normalized soihat the" sum up to 1 (see equation 2.3 with x:=o W, replaced by x:=, w,). Such normalization is disadvantageous for simulating an arbitrary given sigmoidal gate G,but it may be desirable in a biologicai context. Remark 6. Our construction was based on a specific temvoral codine in terms of reference ltmes T,,, and T,,,. Some bdogsal \ystem\ may pn* vide such rcfcrencctmcs that are trme locked w~ththe onset of a scnsory stimulus. However, our construction can also be adapted to other types of tem~oralcodine that reauire no reference times. For example.. the basic equation at the end of Section 1-uation 1.1-also yields a mechanism for carrying out analog computations with regard to the scheme of "competitive temvoral coding" discussed in Thome and lmbert (1989). . . In this i o d q %heme the frrrng time 1, of neumn a, encdes the analog varrable (1, + d , ~ min,;] ,,(re . 7 4 .1 , whered,. = d,,, , ISthedelay behveen neuron a, and v. For 7 := minl=l. ..(to, +dl) we have according to equation 1.1, "

A

.

296

Wolfgang Maars

-

In this coding xheme the firing time 4. encodes the analog value 1,. minesl t i , where L is the layer of neurons to which u belongs. Thus, equation 2.9 provndes the mwhanism for computing a segment of a linear function for inputs and outputs in competitive temporal coding. Hence our construction also pmvides a method for simulating multilayer neural nets with regard to thisalternativecodingschemeNoreference times T,,, or T,,,,, are needed for that

by spiking neurons with competitive temporal coding, one can add lateral excntation among all neumns on the same layer L. In this way the neumns in L are forced to fire within a rather small time wlndow (correspondq to the bounded output range of a slgmoidal activation function). In other applications, it appears to be more advantageous to employ instead lateral inhibition. This also has the effm of preventing the value of 1,. - mmiSr 13 from becoming t w large.

Remark 7. For the sake of simplicity we have considered ~n the p d ing constructmn only feedfonvsrd computations on networks of spiking neurons. However, thesamesimulation method can be used tosimulate recurrentsigmoidalneural nehby murrentnehvorksofspikingneurons For example, in this way onegetsa novel implementationof a Hop$ddnet (with synchronous updates). At each ''round'' of the computation, the output of a unit of the Hopfield net is encoded by the finng time of a corresponding sptksng neuron relatwe to that of other ncurons m the network. For example, one may assume that a neuron gives the highest possible output value !fit 15 amone the first neurons that fire at that round. Each soikine neuron simulates in temporal coding a sigmoidal unit whose inputs are the firing timer of the other spiking neumns in the network at the previous mund One can employ here competitive temporal coding (we Remark 6); hence no reference tnmw or external clock are needed. A stablestateof theHopfield net is reached in this implementation (with competitive temporal codmg) nf and only tf all neumns fire "regularly" (i.e., at regular intervals with a common intenpike internal) but generally with dnfferentphases. Thwe phase encode the output values of the individual neurons,and together these phaw differencesrepresent a "malled" stored pattern of the Hopfield net Thus, each stored pattern of the Hopfield net is realized by a different assignment of phase differences in some stable oscdlation. This novel implementation of a Hopfield net with spiking neurons m temporalcodmg has,apart from its hlghcomputatlonspeed,anotherfeature that is possibly of biological interest: whereas the input to such network of spiking neuroncan be transient (encoded ~nthe relative t i m q of the firing of each neuron in thenetworkat the first round), itsoutput isavailableover

-~

297

Fast Lgmoidal Networks v m Spxkmg Neurons

a longer tune period, since it is encoded in the phases of the neurons in a

stable global oscillation of the network. Hence, this implementation makes it possible that even in the rapidly fluctuating environment of temporal coding with single sptkes, the outputs from different neural subsystems (which may operate at different time wales) can be collected and integrated by a larger neural system.

3 Universal Approximation Properties of Networks of Noisy Spiking Neurons in Temporal Coding

One of the most interesting and useful features of sigmoidal neural nets N s the fact that nf [O. yl is the range of them activation functions, they can approximate for any gwen natural numbers n and k any given continuous function F from 10. yl" into 10. y l h i t h i n inany given E > 0 (with regard to unnform convergence, is., the L, norm). Furthermore it sufficestoconsider for this purpose feedforward nets N with just one hidden layer of neurons and (roughly) any activation function o that is not a polynomial (Leshno t t al. 1993) In addition, many years of experiments with backpropagation andother learning rulesfor sigmoidal neuralnets haveshown that for most concrete application problems, sigmoidal neural nets wlth relatively few hidden units allow a satisfactory approximation of the underlying target function F. The result from the preceding section allows us to transfer these results to networksof spiking neurons with temporal coding. If some feedforward rigmoidal neural net N approximates an arb~trarygwen cont~nuousfunc[O. ylk within an e (with regard to the L, norm), then tion F: 10. yl" of splklng neurons that we with probabslity ? 1 - 6 the network constructed in k t i o n 2 approximates the same F within 28 (with regard to the L, norm). Furthermore, if N has only a small number p of layers, the computation time o f N ~ can be bounded (for biologically reasonable choices of the parameters involved) by 10. y ms. Thus if one neglects the fact that the fan-in of bmlogncal neurons s bounded bv some fixed lalthoueh the owedine" the. " rather lareelconstant. ,, . oretlcal results sueeest "" that networks of bioloelcal neurons can 11" . mite , of lhwr " r h w n e d ) apprdxtmaw arbtlrary ronlmu~urlunclwns F 10, y l" 10. v]' wnlhtn any gnen I wtlh a compulatam ttme 01 not more than 20 rn, hnally. wr would hkr to p r n t out thal o u r apprormmalum result hold, no1 only lor the parttcular way of enr.odmg analog Inputs and outpul, by firing timw that was consided ~nthe previoussection but basscally for m y coding that iscontmuously related toit. Moreprecisely,let n. iI. k. k bearbi10.11" be any continuous function trary natural numbers Let Q: 10. yl" that specifiesa method of "decoding" 1 variables ranging over [O. I] from the firing times of n neurons dur~ngsometune window of length y (whore

-

.

-

-

Wolfgang Maass

298

-

end point is marked by the firing time of the first one of these n neurons), [O. ylk be any continuous and invertible function that and let R: [O. lji describes a method for "encodine" outout variables raneine " " over 10.11 bv, the firing times of k neurons durmg some time window of length y (whose end point is marked by the firing time of the first one of these k neurons). [O, 1li thecomposition Then for any givencontinuous function f : I0.11" F. R o f o ~ o these f three functions isa continuous function from 10, y ] " into [O. ylk . Accordtngtoour precedingargument, thereexistsa networkfl,.s of noisyspikingneurons(withone"hiddenlayer")such that forany? E [O. y]" one has for theoutputNF.s(x)of this networkN?,a that II F@)-N,.s&) II 5 F with probability ? 1 - 8 , where II . II can be any common norm. Hencea,.! approximates for arbitrary inputs the given function f : [ 0 , 1 ] b [O.Ilk for arbitrarily chosencontinuous functions R Q for codingand decoding of analogvariablesby firing timesofspikingneurons witha precisionofat least sup (I1R-'(y) - R-'(f) 11: y. y' [0.1Ik and II y - f l l E ) . Thus, if the inverse R-' of the function R is uniformly continuous, one can approximate P with regard to neural coding and decoding described by R and Q with arbitrarily high precision by networks of noisy spiking neurons with just one hidden layer.

"

.

.,.

4 Consequences for Learning

In the traditional intemretatlon of (unsupervised) Hebbian learninn, a synapw r\ atrengthenrjlf both the pmsynaptic and the portsynapttc &uronsares~multaneously"actwe" (re ,both give hrgh output values ~nterms of the~r current frrtngrates) In thc implementat~onof a sramordal neural net N bv a network of &kine neurons NW .... ,"i in Section 2. ;he "weiehts" r, of N are in fact modeled by the "shengths" w, of corresponding synapsez between spiking neurons. However, the information whether both the presvnaptic a i d p&tsynaptic neurons give high output values in temporal coding can no longer be read off from their "activity" but only from the time difference T, between their firing tunes. This observation eives rise to the auestion of whether there are biological mechanisms known that support a modulation of the efficacy (i.e. "strength) w, of a synapse as a function of this time difference T,. If one worksin the linear r a n k of the simulation Nr. of a n.-eate , " G accordine IoStvtron 2 (whereC;computes the funct~on(s, 5 , ) and h,ttl describes an tl'S1: then for Ilebbmn learnma tt would bc des~cablcto increasew, = A . r, if si is close to E:=, r, . s,; that is, if the differencein firing r,.s,-T,,+~,isclosetoT,,,~ -T ,,,. On timesT, := t,,-(T,.-5,) = theother hand, one would like to decrease" w, if T, issubstantially smaller

.

-

.

"

-

. . .

,,,, z:=, ,,

Fast Sigmoidal Networks via Spiking Neurons

299

or larger than Tour- T,.. Hence, a Hebbianstyle unsupervised learning rule of the form

(for suitable parameters 6, p > 0)would be meaningful in this context. Recent results from neurobiolow (Stuart and Sakmann 1994)show that action potentials in neocortical pyramidal cells are actively (i.e., supported by voltage-dependent channels) propagated backward from the soma into the dendrites (see also laffe el al. 1992). Hence the time difference T , between the firing of the presynaptic and the postsynaptic neurons is in principle available to each synapse. Furthermore new experimental results (Markram 1995; Markram and Sakmann 1995) show that in vitro. the efficacy of synapses of neocortical pyramidal neurons is in fact modulated as a function of this time difference T;. There exists one interestine structural differencebetween this intemretation of Hebbian learning in temporal coding and its traditional interpretation: the time difference T, provides a synapse with the full information about the correlation between the output ;alies of the pre- and postsynaptic neurons in temporal coding, no matter whether both neurons give high or low output values. However, in the traditional interpretation of Hebbian learning in terms of firing rates, the efficacy of a synapse is increased only if both neurons give high output values (in frequency coding). An implementation of Hebbian learning in the temporal domain is also appealing in the context of pulse s w m VLSl (i.e., "silicon spiking neurons"). These artificial neural nets are much faster than biological spiking neurons: thev can work with intermike intervals in the microsecond ranee. If for a hardware implementation of a slgmoidal gate with pulse stream VLSl according to the construction of Section 2, a Hebbian learning rule can be applied in i h e temporal domain after each pulse, such a chip may be able to carry out so many learning steps per second that it could in principle (neglecting input-output constraints) overcome the main impediment of traditional artificial neural nets: their low learnine meeds. So far we have assumed in the construction of N N . ~ in ,Section ~ 2 that the time delays d, between the presynaptic neurons a; and the postsynaptic neuron u (i.e., the time needed until an action potential from a; can influence the potential at the trigger zone of v ) were the same for all neurons a,. Differences amonn these delays have the effect of ~ r o v i d i n e an additive corwtloni, rt r, s d , tothevariablr that isrommunicaled in temporal codIn): I n m a , I O U llenre, they also have Ihrability togivedifferent "we~ghts" todifferent inout variables. In a bioloeical context.. jhev, aowar to be useful for providing to the network a priori information about its computational

-.

-

".

..

Wolfgang Maass

300

task, so that Hebbian learning can be viewed as "hne-tuning" on top of this vrevrorrrammed information. If there exist bioloeical mechanisms for modulaiin&uchdelays(see Hopfield 1995; ~ e m ~ t e r h1996). l . they would provide in addition toa short-term memory via synaptic modulationa separate mechanism for staring and adapting long-term memory via differences in the delays. 5 Conclusions

We have shown in this article that there exists a rather simple way to compute linear functions and to simulate feedforward as well as recurrent sigkoidal neural nets in temporal coding - bv , networks of noisv spiking neimns. In contrast to the traditionally considered implementation via frequency coding, the new approach yields a computation speed that is faster by several orders of magnitude. In fact, to the best of our knowledge, it pmvides the first theoretical model that is able to explain the experimentally observed speed of fast informat~onpmcesing in the cortex on the basis of relatively slow spiking neurons as computational units. Further experimentswill be needed todetermine whether this theoretical model is biologically relevant. One problem is that we do not know in which way batch inputs (consisting of many analog variables in parallel) are encoded by biological neural systems. The existing results on neural codine (see Rieke el al. 19961 address onlv the codine of time serie. that 13. wqurnt~al analog mpul, However. tf further ekperrmenrz -howrd that the mput-dependent fmng trmes nn vtsual cortex, as reporld In Bau el a! . . (1994). varv in a continuous (i.e, piecewise continuous) manner in resvonse to smooth change, of comple* mputs, t h ~ would s pro\ tde some support to rr nth sptkmy, thc4yleof thvnn.t~calmodel5 foranalogromputat~on . . . .neurons that is considered here. HovFurthermore. a bioloeical realization of recurrent neural n e b ke.. <, f~eldnets) m temporal codrng wrrh sptkmfi neurons, as pmposed m Kemark 7 ~n Serlnln 2, would pmdm the a r u r r m c c of oacdldtwna m neural svstems (eoeciallv involved in vattern momition and workine * , svstemi , memory), where the phase of individual neurons (or of pools of neurons) with regard to this oscillation is input dependent. A noteworthy feature of the constructions presented here is that the "weights" (efficacies)of synapses are in principle able to play in temporal coding the same role as in the more familiar context o f f i r m ~rate codinx. Hence, all theories and experimental results regarding adaptation of neural circuits via synaptic plasticity can in principle also be applied to such computations in temporal coding. Furthermore, a closer Iwk shows that the networks that we have constructed for analoe comvutations in temvoral coding can comvute the same analog functron on a different ttme scale m/lrmg rate iod,ng Th,; poba~ble dun1 role of the ctrcutt..of sp~kmgneurnns zonstructed here IS of mrarrst ~n

,.

-

.

..

- .

Fast Sigmo~dalNetworks via Spiking Neurons

301

the context of biology. There exists empirical evidence that some neural systems carry out in addition to a very fast preliminary computation a more thorough subsequent computation in terms of firing rates, which has the abilitv to inteerate relevant contributions from several neural subsvstems whose com~utationaloreaniaatlon can be understood-as an imolementa-

of view of computational complexity theory and engineering applications. In addition, feedforward and recurrent sigmoidal neural nets are the only parallel computat~onalmodels known that support self-organization (and hence do not requirecumbersome and error-pronecentral "programming") Hence if the "hardware" of biological neural systems allows m principle an efficient implementation of sigmoidal neural nets, it is not unreasonable to assume that this possibihty has been realized by at least some biological neural systems (in spite of our lack of knowledge about the concrete functions that they compute). Apart from their biological interpretation, the construction of this article appears to be of interest also in the context of hardware implementations of neural nets via ~ulse-streamVLSI (see. e.e. Pratt 1989, Horinchi et al. IWI, M u r r q and laresscnko 1994. Jahnkc rt al. 1995) ,\ itmulatton 01 a ferdhrward slgmo~dnlneural net hy pulw itnem VLSI a h ) : the Ime+ of this constructi& offers the oassihil& of combinine extremeiv , hieh " cornputatlon speed (simulating one parallel computation step of the sigmoldal neural net per pulse) with a noise-robust and fast transmission of internal analoe ;ariables via the relative timine of ~ u l s e from s different eates. memory. We have shown in this article that networks of spiking neurons wlth tem~oralcodiie have at least the same computational power as simnoidal neural nets of roughly the same size and depth. Together with a recent separation result (Maass 1996b) this implies that nehvorks of spiking neurons have in fact strictly more computati~nalpower. Acknowledgments I thank Martin Amndt, Davld Haussler, Geoffrey Hinton, John Hopfield, Berthold Ruf, and Gordon Shepherd for stimulating discussions related to the tomc of this article. 1also thank the anonvmous referees for their h e l ~ f u l comments.

Wolfgang Maass

302

References Abeles, M., Bergman, H., Margalit, E., and Vaadia, E. 1993.Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. I. Neurophysiology 70,

.-.

1&7(1-1 L.111 . u ,

Bair, W., Koch, C.. Newsome, W. and Britten, K. 1994. Reliable temooral modulation in cortical soike trains in the awake monkev. In Pmc. Svmoosium on < , Dynamics of Neural Processing, Washington, DC. Bialek, W., and Rieke, F. 1992. Reliability and information transmission in spiking neurons. Trends in Neuroscience 15,428-434. ~ r ~ a ; tH. , L., and Segundo, J. P. 1976. I. Physlolog!, 264,279-314. Ferster, D., and Spruston, N. 1995. Cracking the neuronal code. Science 270,756 757 . -..

Gemher, W., and van Hemmen, J. L. 1994. How to describe neuronal activity: Spikes, rates, or assemblies? In Advnnces in Neural Information Processing Systems 6, pp. 463-470. Morgan Kaufmann, San Mateo, CA. Heller, J., Hertz, J. A,, Kjar, T. W., and Richmond, B. J. 1995. Information flow and temporal coding in primate pattern vision. 1. Computational Neurorctence 2(3), 175-193.

Hopfield, J. J. 1991.Olfactory computationand object perception. Proc. Nat. Acod. Sci. USA 88.64624466. Hopfoeld.I I 19% Pattern recugnhtron computatmn usmg acttun potential trmIng for stimulus representattons Nature 376.33-36 H0nnchr.T. Lazmro, l ,Moore, A andKach C 1991 Adelay-lmc based motron detection chip. In Advancesin Neural Information Processin; Systems3, pp. 6 412. Morgan Kaufmann, San Mateo, CA. Jaffe, D. 8.. Johnston, D., Laser-Ross, N., Lisman,J. E., Miyakawa, H.,andRoss, W. N. 1992. The spread of Nat spikes determines the pattern of dendritic Ca2+entry into hippacampal neurons. Nature 357,24&246. Jahnke, A,, Roth, U., and Klar, H. 1995. Towards efficient hardware for spikeprocessing neural networks. In Pmc. of the World Congresson Neural Networks, Washington, DC. Johnson, D. S. 1990. A catalog of complexity classes. In Handbook of Theoretical Computer Soence, J. V. Leeuwen, ed., vol. A, pp. 67-162. MIT Press, Cambridee, MA. Kempter, R,Gerstner. W . van Hemmen. I. L.. and Wagner, H. 1996. Temporal codmgm thcsub-mdlwecond range. Modelof barnowl audrtary pathway In Advances In Neural Informalton Procr~sm~ 8, . pp . Systems . . 124-1M MI1 I'ress. Cambridge, MA. Leshno, M., Lin, V. Y., Pinkus, A,, and Schocken, S. 1993. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6 . 8 6 1 4 7 . Linsker, R. 1988. Self-organization in a perceptual network. Computer Magazine 21.10M17.

Fast Sigmo~dalNehvorks via Spiking Neurons

303

Maass, W. 1995. On the computational power of networks of spiking neurons. In Advances in Neural Informatton Pmcessing Systems 7,pp. 183-190. MIT Press, Cambridge, MA. Maass, W. 1996a. On the computational power of noisy spiking neurons. In Advances in Neurol Information Processing Systems 8, pp. 211-217. MIT Press, Cambridge, MA. Maass, W. 1996h. Networks of spiking neurons: The third generation of neural nehvork models. To anwar in Neuaral Networks. Malnen. Z t,and !xlnowskl, T 1 1W5 Keltabrl~tyof sptke trmmg ~nneocortncal nrurunb S~tmt<268150F150h Markram H 1 9 5 Nmcortral pyrnm~dalneurons walc thccfftcacy. of ,ynaptnc . . input according to arrival time: A proposed selection principle of the most appropriate synaptic information. In Pmc. of Cortical Dynamtcs, pp. 10-11, lerusalem. Markram, 11, and Sakmann. U 1995 A c t m putent~al,pmpagatmg back anto d m d n t o tr~ggcrc h a n g s m rffncary of m g l r axon synapses bclween layer V pyramrdal neurons In Pror of lbp Confpreu e of tit* Sorwty fur hrurmrwncr.

..

.-.. --.

.,,d

,1

Mel, B. W. 1993.Synapticintegration inanexcitabledendritic tree. I. Neumphysrology 70,10861101. Murray, A., and Tarassenko, L. 1994. Analogwe Neural VLSI: A Pulse Stream Approach. Chapman & Hall, London Perren D. I., Rolls, E. T , and Cam, W. C. 1982. Visual neurons responsive to faces in the monkey temporal cortex. Experimental Brain Research 47,329-342. Pran, G. A. 1989. Pulse Computation. PhD. thesis, MIT. Rieke, F., Warland, D , Ruyter van Steveninck, R., and Bialek, W 1996. SPIKES: Exploring the Neural Code. MIT Press, Cambridge, MA. Rolls, E.T. 1994.Brain mechanisms for tnvariantvisual recognition and learning. Behauiaural Processes 33,113-138. Rolls, E. T., and Tovee, M. 1. 1994. Processing speed in the cerebral cortex, and the neurophysiology of visual backward masking. Proc. Roy Soc. B 257,9-15. Roychowdhury, V, Siu, K., and Orlitzky, A,, eds. 1994. Theoretical Aduances in Neurol Computation and Learning. Kluwer, Boston. Rumelhart, D. E., Hinton, G. E., and Williams, R. 1.1986. Learning repmentalions by hack-propagating errors. Nature 323.53S536. Segundo.1. P. 1994. Noiseand theneurosciences: A long history, a recent revival, and some theory. In Origins: Bmn and SelfOrganirntion, K. H. Pribram, ed., p p -331. Lawrence Erlbaum Associates, Hillsdale, NJ. Sejnowski, T. I. 1995. lime for a new neural code? Nature 376.21-22. Shepherd, G. M. 1994. Neumbiology. 3d ed. Oxford University Press, New York. Siu. K., Roychowdhury, V., and Kalaith, T. 1995. Discrete Neural Computatmn. Prentice Hall, Englewood Cliffs, NI. Softky,W. 1994. Sub-millisecond coincidence detection in actwe dendritic trees. Neumscience 58. 1 3 4 1 Stuart.G.1.. and S a k m a ~B. , 1994. Active propagation of somaticaction patentials into newortical pyrimidal cell densities. Nature 367.69-7'2.

3M

Wolfgang Maasr

Thorpe, S. J.,and Imbert, M. 1989. Biological constraints on connectionist modelling. In Collnrclronirn! tn Pcrrplzur, R. Pfeifer, 2. Schreter, F FogelmanSouli6, and L Steels, eds ,pp 63-92. Elrevw, New York von Neurnann, J. 1956. Probabil~rt!~ l o g m and the synthesis of reliable organisms fmm unreliable components. In Automala Sladies, C. E. Shannon and J. MacCarthy, edr Prmceton Univers~tyPress, Princeton, NJ.

Communicated by William Softky

Computing with the Leaky Integrate-and-Fire Neuron: Logarithmic Computation and Multiplication Doron Tal Eric L. Schwartz Department of Cognitive and Neural Systems, Boston University, Boston, MA 02215 USA

The leaky integrate-and-fire (LIF) model of neuronal spiking (Stein 1967) provides an analytically tractable formalism of neuronal firing rate in terms of a neuron’s membrane time constant, threshold, and refractory period. LIF neurons have mainly been used to model physiologically realistic spike trains, but little application of the LIF model appears to have been made in explicitly computational contexts. In this article, we show that the transfer function of a LIF neuron provides, over a wideparameter range, a compressive nonlinearity sufficiently close to that of the logarithm so that LIF neurons can be used to multiply neural signals by mere addition of their outputs yielding the logarithm of the product. A simulation of the LIF multiplier shows that under a wide choice of parameters, a LIF neuron can log-multiply its inputs to within a 5% relative error. 1 Introduction In his book on Mach bands, Ratliff (1965, p. 129) remarks that “often the relation between external stimulus and neural response is approximately logarithmic.” As shown in Figure 1, such a relation is also observed in simulations based on the Hodgkin-Huxley equations (Hodgkin and Huxley 1952) where a step current of varying amplitudes is applied to a single neuron. A similar compressive response is seen, for example, in M-cells (of the magnocellular stream) of visual cortex. In psychophysics, logarithmic neural transduction has been hypothesized to account for Weber’s law (Cornsweet 1970; Land 1977). In the neural modeling literature, a logarithmic transfer function is also commonly hypothesized; for example, Koch and Poggio (1992) describe a mechanism for multiplying with neurons whose transfer function is logarithmic, and Yeshurun and Schwartz (1989) show that an estimate of the power spectrum of the log power spectrum, called the cepstrum in signal processing, provides a simple model for estimating stereo disparity when applied to a columnar image data structure such as the ocular dominance column system of primate visual cortex. These references to logarithmic functionality in neurons lack a generic biophysical Neural Computation 9, 305–318 (1997)

c 1997 Massachusetts Institute of Technology °

306

Doron Tal and Eric L. Schwartz

Figure 1: (a) Current-frequency curve resulting from uniform depolarization of a membrane modeled by the Hodgkin-Huxley equations, using the giant squid axon parameters (Hodgkin and Huxley 1952). (b) Semilog plot of the curve in (a).

justification for this computation on the single cell level. In this article, the leaky integrate-and-fire (LIF) model is shown to provide such a justification. Specifically, we show that the ratio of refractory period (t0 ) duration to membrane time constant (τ ) controls the degree of compressiveness of a neuron’s transfer function: a cell with small t0 /τ has a quasi-linear transfer function, larger t0 /τ gives a quasi-logarithmic transfer function, and yet larger t0 /τ values yield a transfer function more compressive than a logarithm. The article concludes with a discussion of precision and dynamic range of LIF and other single-cell computational models. Although the LIF model used in the simulations is a highly lumped model, its few parameters correspond to measurable physical quantities. In order to demonstrate how the LIF model’s parameters affect its transfer function, the simplest (four) parameter LIF model is used, ignoring spiketrain variability and population codes. (See Softky and Koch 1993 and Softky 1995 on spike-train variability. Knight 1972 and Usher et al. 1993 discuss the improved temporal dynamics of population coding.) The work of Bugmann (1991) also demonstrates two computational regimes—multiplicative and linear—for an LIF neuron, but it uses a different LIF model from the fourparameter model below. For multiplication, a coincidence detector (Srinivasan and Bernard 1976) is used and the LIF neurons include a model of both spike-train variability and the relative refractory period. Despite the difficulties with the coincidence detection multiplier discussed below, Bugmann’s simulations indirectly demonstrate the same effect of t0 /τ on the cell’s transfer function that is shown below. This suggests that simple integrate-and-fire properties of a cell, such as its refractory period duration and the membrane time constant, continue to affect the cell’s transfer

Computing with the Leaky Integrate-and-Fire Neuron

307

function, even in the presence of spike-train variability and of a relative refractory period model. It is worth noting that logarithmic computation in the models below is not presumed to occur in sensory transduction neurons because of the complicated machinery associated with such neurons. Rather, the models suggest that quasi-logarithmic computations could occur in interneurons of the central nervous system. For this reason, parameter values are based on physiological measurements in cat and rat pyramidal cells (see Section 3.1). 2 Properties of the LIF Transfer Function Agin (1964) appears to have been the first to relate Hodgkin-Huxley simulations of a uniformly polarized membrane to “the logarithmic law of sensory physiology” by fitting the curve in Figure 1a to the equation f = 27 log (I + 1), where f is firing frequency and I is applied current. Although the Hodgkin-Huxley model is sufficient to produce a loglike transfer function, as is evident in Figure 1, its complex dynamics are not necessary for explaining the nature of the compressiveness. The much simpler LIF model is all that is required to account for the transfer function shown in Figure 1. As shown below, a neuron’s membrane resistance and capacitance and the duration of its absolute refractory period are sufficient to account for the loglike transfer function shown in Figure 1, without a need to model the detailed dynamics of action potential generation. 2.1 The LIF Model. In the LIF model, a steady current source, I, uniformly depolarizes the membrane, causing an exponential increase in its potential, V. When the membrane potential reaches a threshold, VTh , a spike occurs (at this point we shall assume the spike to take an infinitesimal amount of time). At the spike onset, the membrane capacitance discharges instantaneously, and the membrane potential is reset to the resting potential. This firing process repeats for as long as the input current is on (see Fig. 2b). A resistor and a capacitor in parallel model the membrane’s charging process (see Fig. 2a). The RC integrator of the LIF model relates membrane potential to input current via the usual exponential charging relationship (Horowitz and Hill, 1989): V = IR[1 − e−t/τ ] ,

(2.1)

where τ = 1/RC is the membrane time constant. We now outline Stein’s analysis of the leaky integrate-and-fire model. The time tisi (the time it takes the membrane to reach VTh ), also called the interspike interval, can be calculated by setting V = VTh , t = tisi in equation 2.1

308

Doron Tal and Eric L. Schwartz

Figure 2: (a) RC circuit used to model charging of a neuron’s membrane from its resting potential to VTh . C corresponds to the membrane capacitance, R corresponds to the membrane resistance, and I corresponds to an excitatory input such as a current injection or a steady current produced by a train of excitatory postsynaptic potentials. t0 is the duration of the absolute refractory period. (b) Membrane voltage versus time in an RC circuit LIF model, in response to a step current. Only the suprathreshold spikes are produced in the output.

and solving for tisi : µ

tisi

VTh = −τ ln 1 − IR

¶ .

(2.2)

The rheobasic current, Irh , is defined as the smallest value of current that can drive the membrane potential to VTh : Irh =

VTh . R

(2.3)

Computing with the Leaky Integrate-and-Fire Neuron

309

Substituting equation 2.3 for I in equation 2.2, tisi can be rewritten as ¶ µ Irh . (2.4) tisi = −τ ln 1 − I The cell’s firing frequency f is the reciprocal of the period of each action potential. Since an action potential is considered to have infinitesimally small duration, the period of each action potential is just the interspike interval (from equation 2.4). The firing frequency as a function of current is therefore f =

1 . −τ ln(1 − Irh /I)

(2.5)

The LIF model described so far does not produce a compressive or quasilogarithmic current frequency relation. As current is increased in equation 2.5, the current frequency relation becomes linear. Stein has shown that equation 2.5 becomes linear for large I by expansion of its power series: f =

1 1 2 2 (Irh /I)

−τ [(Irh /I) + + 13 (Irh /I)3 + · · ·] · ¸ 1 1 1 (I/Irh ) − − (I/Irh )−1 − · · · . = τ 2 12

(2.6)

From equation 2.6, we see that as I grows larger than IRh , the frequency becomes proportional to the line with slope RC intersecting the abscissa at 1/2. The inverse frequency, or rate, is thus quasi-linear in the approximation considered here, which ignores the effect of refractory period(s) (see Fig. 3). The quasi-linear output of the simple LIF model therefore does not provide a compressive transduction function. However, the analysis of the effects of refractory behavior (both relative and absolute refractory period), which was included in the LIF analysis of Stein (1967), changes the output function of the LIF model from linear to quasi-logarithmic, in suitable parameter ranges, as follows: Let the absolute refractory period take t0 time1 ; the interspike interval in equation 2.2 is then tisi + t0 , and the frequency is f =

1 1 . = t0 + tisi t0 − τ ln(1 − Irh /I)

(2.7)

1 Note that the relative refractory period—the time period after a spike during which it is difficult but not impossible to generate another action potential—is not taken into account in this model, but Stein (1967) has shown that the effect of including a relative refractory period modeled by rising and falling exponentials is qualitatively the same as that of increasing the value of parameter t0 in the current model; that is, as either the absolute refractory period or the relative refractory period is increased, the current frequency relation becomes more compressive. The parameter t0 can therefore be thought of as lumping the effects of both the absolute and relative refractory periods.

310

Doron Tal and Eric L. Schwartz

Figure 3: Plot of the integrate-and-fire transfer function (solid line) with no absolute refractory period (equation 2.5 or equation 2.7 with t0 = 0). This transfer function approaches the line I/τ − 0.5τ (I is shown in units of the rheobasic current). Thus if the time constant of a neuron is much larger than the refractory period duration, the transfer function of the cell is quasi-linear. τ is 0.007; all other parameter values are shown in Table 1.

Table 1: Parameter Values Used in LIF Multiplication and in Figures 3 and 4. Parameter

Value

Units

t0 τ Vth C R Irh

0.002 varies 0.015 6 · 10−11 τ/C, varies Vth /R, varies

S S V F Ä A

t0 = absolute refractory period; τ = time constant (t0 /τ is varied in the simulations by keeping t0 constant and varying τ ); Vth = membrane threshold voltage; C = soma capacitance; R = soma membrane resistance; Irh = rheobasic current.

Computing with the Leaky Integrate-and-Fire Neuron

311

Note that as I grows arbitrarily, f in equation 2.5 grows without bound, while in equation 2.7, f remains bounded at 1/t0 . This boundedness compresses high I values and yields a log-like response. The larger t0 is in equation 2.7, the more f is compressed. The current frequency relation expressed by equation 2.7 maintains the same shape when the ratio t0 /τ remains constant. Figure 4a shows plots of equation 2.7 for different ratios t0 /τ . Of the curves in Figure 4b, we see that the one whose corresponding t0 /τ is between 0.1 and 0.2 best approximates a log, since it produces the most linear curve on a semilog plot. One critical issue is the goodness of fit of the LIF transfer function to the logarithm. We have taken the approach of defining goodness of fit in terms of the operation of multiplication. In other words, if the compressive nonlinear transduction function supports an accurate multiplication model, then we deem it to be a good quasi-logarithmic function. In the following section, an LIF neuron with biologically reasonable parameter values is quantitatively shown to multiply its inputs to within a 5% relative error. In summary, we conclude that the log-like response of neurons depends on two properties of the LIF model: 1. The neuron’s time constant (affecting integrative behavior). 2. The existence of a refractory period (affecting rate saturation). We can now see why the Hodgkin-Huxley equations produce a log-like behavior, according to Agin’s original observation. The underlying passive circuit in the Hodgkin-Huxley model is an RC circuit with an active component—voltage-dependent conductances—that corresponds to the LIF model’s firing mechanism. Although it is difficult to make general statements about the detailed behavior of the Hodgkin-Huxley dynamics (due to its mathematical complexity), it appears that the logarithmic behavior Agin (1964) observed, and replicated in Figure 1, is accounted for by the passive integrative component alone; it is captured by the LIF model. In any event, the LIF model is sufficient to account for Agin’s observation of the logarithmic behavior of the Hodgkin-Huxley model. 3 LIF Multiplication Many arguments have been proposed for the usefulness of multiplication in the nervous system. Functionally the product of two signals produces an analog measure of their correlation or, in boolean terms, their conjunction, implemented by the AND operator. Barlow (1969) argues that selectivity and generalization in the visual system require multiplication and summation operations. The Reichardt motion detector (Reichardt et al. 1989) relies on a multiplicative operation. (For a review of neural uses and neural models of multiplication see Koch and Poggio 1992.) A simple scheme for neural multiplication mentioned by Koch and Poggio (1992) involves neurons with

312

Doron Tal and Eric L. Schwartz

Figure 4: (a) Plots of equation 2.7 for different values of t0 /τ (key shows t0 /τ ). (b) Semilog plot of the curves in panel (a). Parameter values are shown in Table 1.

Computing with the Leaky Integrate-and-Fire Neuron

313

a logarithmic transfer function that excite or inhibit each other additively or subtractively. Under these assumptions, if x1 is the input to neuron A and x2 is the input to neuron B and the outputs of both neurons A and B are added, then the sum is log (x1 ) + log (x2 ) = log (x1 x2 ),

(3.1)

which is a multiplicative relation. Thus, multiplication and division can be implemented by adding and subtracting outputs of neurons with a logarithmic transfer function. In order to evaluate to what extent, and over what parameter ranges, the transfer function of a neuron is logarithmic, the multiplication model of equation 3.1 was simulated using a single LIF neuron receiving two simultaneous current injections arriving from two LIF cells. A log product provides the functionality of multiplication or correlation simply because it is a monotonic function of the product. The analysis below uses the multiplication model to demonstrate that a quasi-logarithmic computation in neurons can occur accurately across a biologically reasonable t0 /τ range. 3.1 Parameter Choices. Since the interspike interval does not vary in this model and inputs are held constant through time, the equilibrium solution (see equation 2.7) is used. To solve equation 2.7, four parameters need to be specified: t0 , τ , Vth , and C. Parameter choices are based on physiological values observed in cortical pyramidal cells of the rat and the cat, summarized and derived in Bugmann (1991). The input injection current, Iinj , varies over a realistic range of current injection: [1Irh , 13Irh ] amperes, with threshold potential, Vth = 15 mV (see Wolfson et al. 1989). Vth lies between the axon hillock threshold and the soma fast sodium conductance threshold (Schwindt and Crill 1982). The capacitance, C = 6 · 10−11 , agrees with published values of the neural membrane capacitance (1-2µF cm−2 ) (Jack 1979; Kawato et al. 1984) and surface area of the soma of pyramidal neurons (200–700µ m2 ) (Hillman 1979). The refractory period, t0 = 2 ms, is comparable to that used in other biologically motivated simulations of LIF neurons (see Harmon 1961; Amit et al. 1990). Table 1 summarizes the parameter values and derived constants in the model. 3.2 Precision and Dynamic Range. The firing rate of neurons typically extends from a few hertz to a few hundred hertz. The utility of any numerical function dependent on firing rate is thus limited to this range. We have examined the precision of a multiplier that is constructed from two summed quasi-logarithmic LIF neurons. We used the LIF transfer function (cf. equation 2.7) for a range of different t0 /τ (as plotted in Figure 4) and defined the error by “multiplying” several thousand combinations of input currents. This was done by summing the corresponding “quasi-log” firing rates using the LIF curve and comparing this method of multiplication with

314

Doron Tal and Eric L. Schwartz

the correct numerical answer. We defined error, δ, by the average relative error, that is, as δ = 1/N

N X |(L(ai bi ) − ai bi )| i=1

ai bi

,

(3.2)

where ai and bi are two “input” currents to be multiplied. The estimate of their product, ai bi , is obtained by summation of the “quasi-logarithmic” LIF transfer function as follows: Let fLIF (I) be the frequency-current relation expressed by equation 2.7. We first generated a number of random current value pairs {ai , bi }. Multiplication of each pair was performed by the equation ¡ ¢ fLIF −1 fLIF (a) + fLIF (b) ,

(3.3)

−1 where fLIF is equation 2.7 and fLIF is its inverse. Using the inverse function permits the unit of the multiplier (current squared) to be the same unit as the actual product of currents, ai bi , and thus the degree of correlation between the two products can be determined. The products obtained by equation 3.3 are then fitted using linear interpolation to the numerically correct products, ai bi , obtaining an equation for a straight line, L(I).2 A new set of LIF products is then generated, and the LIF products are interpolated via L(ai bi ) and then subtracted from ai bi , as in equation 3.2. Equation 3.2 provides a measure of the average deviation from a perfect multiplication, as a fraction of the product itself. Figure 5a shows the average relative error of the LIF products as a function of t0 /τ . This error, expressed as a percentage, is around 5% for 0.13 < t0 /τ < 0.23, indicating that in this input range, the “quasi-logarithmic” LIF transfer function is useful as an approximate four-bit multiplier. Figure 5b shows the data points used for the error estimate for the case of t0 /τ = 0.2 in Figure 5a. The timing constraints of the LIF model above can be considered by measuring the time to equilibrium of the LIF differential equation as it is iterated in time. As shown in Figure 6, the voltage of a LIF neuron reaches equilibrium (that is, the membrane capacitance saturates) after about 25 ms, across a range of biologically reasonable injection current levels. This means that the multiplication model above can work only when the input to a cell

2 Note that this linear regression is used simply to rescale the output in order to compare it to mathematically correct multiplication. In neuronal terms, this implies a gain and offset at the summing neuron that would need to be specified if actual numerical multiplication were desired in a particular model. However, without this, the simple model of summing of the output of two “logarithmic” neurons provides a result proportional (with an offset) to the desired numerical (log) product over the entire dynamic range of the neuron.

Computing with the Leaky Integrate-and-Fire Neuron

315

Figure 5: (a) A plot of equation 3.2, the average relative error of the LIF products as a function of t0 /τ . N, the number of samples used for each point, is 10,000 (see text for details of this simulation). (b) A sample of 100 data points used to obtain the sixth point from the left in panel (a). t0 /τ for this point is 0.2, and the relative error in multiplication is approximately 5%. Both ordinate and abscissa 2 in (b) have units of IRh .

316

Doron Tal and Eric L. Schwartz

Figure 6: An integrate-and-fire neuron’s response over time to varying constant input values. Equilibrium is reached in approximately 25 ms, meaning that for logarithmic computation to occur to within a 5% accuracy, the input’s frequency cannot exceed 40 Hz, a reasonable requirement for cortical neurons.

varies temporally with an average period on the order of 25 ms. This time resolution corresponds to a maximum input frequency of 40 Hz that can be resolved by the LIF model presented above. A known model of multiplication that takes into account spike timing is the coincidence detector (Srinivasan and Bernard 1976) (this model is also the multiplicative mechanism used in Bugmann’s 1991 LIF model). One difficulty with coincidence detection is that for low spike rates modeled by a poisson distribution, its accuracy grows as the logarithm of the time window used to count coincidences. Thus, the activity of a single-cell coincidence detector whose inputs spike at a mean rate of 20 Hz would have to be integrated for approximately 20 seconds in order to achieve a 5% accuracy in its multiplication estimate. It has been argued, however, that the coincidence detector could be functioning on the faster temporal scale of dendritic summation (Softky 1995). (See also the reply by Shadlen and Newsome (1995) in the same issue, which analyzes the temporal summation properties of LIF coincidence detectors.)

Computing with the Leaky Integrate-and-Fire Neuron

317

4 Conclusion This work examined the nature of the observed compressive neuronal transfer function. A simple LIF account with four physical parameters was shown to account for a logarithmic current frequency relation. The single dimensionless parameter t0 /τ was shown to control the shape of a LIF neuron’s current frequency transfer function, changing it from linear (for small t0 /τ ) to quasi-logarithmic (for larger t0 /τ ) under biologically reasonable parameter ranges. The degree to which a cell’s transfer function approximates a logarithm was measured at different values of t0 /τ by computing the deviation of LIF−1 {LIF(A) + LIF(B)} from the product AB, using equation 3.2. The transfer function was shown to be approximately logarithmic (to within 5%) over a biologically reasonable and robust t0 /τ range. Acknowledgments The work for this article was supported by NIMH 5R01MH45969-04 and ONR N00014-95-1-0409. References Agin, D. 1964. Hodgkin-Huxley equations: Logarithmic relation between membrane current and frequency of repetitive activity. Nature 201, 625–626. Amit, D. J., Evans, M. R., and Abeles, M. 1990. Attractor neural networks with biological probe records. Network 1, 381–405. Barlow, H. B. 1969. Trigger features, adaptation and economy of impulses. In Information Processing in the Nervous System, K. N. Leibovic, ed., pp. 209–230. Springer-Verlag, Berlin. Bugmann, G. 1991. Summation and multiplication: Two distinct operation domains of leaky integrate-and-fire neurons. Network 2, 489–509. Cornsweet, T. N. 1970. Visual Perception. Academic Press, New York. Harmon, L. D. 1961. Studies with artificial neurons, I: Properties and functions of an artificial neuron. Kybernetik 3, 89–101. Hillman, D. E. 1979. Neural shape parameters and substructures as a basis of neuronal form. In The Neurosciences (4th Study Program), F. O. Schmitt and F. G. Worden, eds., pp. 432–437. MIT Press, Cambridge, MA. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology 117, 500–544. Horowitz, P., and Hill, W. 1989. The Art of Electronics. Cambridge University Press, Cambridge. Jack, J. 1979. An introduction to linear cable theory. In The Neurosciences (4th Study Program), F. O. Schmitt and F. G. Worden, eds., pp. 423–437. MIT Press, Cambridge, MA. Kawato, M., Hamaguchi, T., Murakami, F., and Tsukahara, N. 1984. Quantitative analysis of electrical properties of dendritic spines. Biol. Cybern. 50, 447–454.

318

Doron Tal and Eric L. Schwartz

Knight, B. W. 1972. Dynamics of encoding in a population of neurons. Journal of General Physiology 59, 734–766. Koch, C., and Poggio, T. 1992. Multiplying with synapses and neurons. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., pp. 315– 345. Harcourt Brace Jovanovich, San Diego, CA. Land, E. H. 1977. The retinex theory of color vision. Scientific American 237, 108–128. Ratliff, F. 1965. Mach Bands: Quantitative Studies on Neural Networks in the Retina. Holden-Day, New York. Reichardt, W., Egelhaaf, M., and Ai-ke, G. 1989. Processing of figure and background motion in the visual system of the fly. Biol. Cybern. 61, 327–345. Schwindt, P. C., and Crill, W. E. 1982. Factors influencing motorneuron rhythmic firing: Results from a voltage clamp study. J. Neurophysiol. 64, 206–224. Shadlen, M. N., and Newsome, W. T. 1995. Is there a signal in the noise? Current Opinion in Neurobiology 5, 248–250. Softky, W. R. 1995. Simple codes versus efficient codes. Current Opinion in Neurobiology 5, 239–247. Softky, W. R., and Koch, C. 1993. The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience 13, 334–350. Srinivasan, M. V., and Bernard, G. D. 1976. A proposed mechanism for multiplication of neural signals. Biol. Cybernetics 21, 227–236. Stein, R. B. 1967. The frequency of nerve action potentials generated by applied currents. Proceedings of the Royal Society B 167, 64–86. Usher, M., Schuster, H. G., and Niebur, E. 1993. Dynamics of populations of integrate-and-fire neurons, partial synchronization and memory. Neural Computation 5, 570–586. Wolfson, B., Gutnick, M. J., and Baldino, F. J. 1989. Electrophysiological characteristics of neurons in neocortical explant cultures. Exp. Brain Res. 76, 122–130. Yeshurun, Y., and Schwartz, E. L. 1989. Cepstral filtering on a columnar image architecture: A fast algorithm for binocular stereo segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence 11(7), 759–767.

Received August 1, 1995; accepted May 2, 1996.

Communicated by Wolfram Geistner

Effect of Delay on the Boundary of the Basin of Attraction in a Self-Excited Single Graded-Response Neuron K. Pakdaman INSERM V444 ISARS Facult´e de M´edecine Saint-Antoine, 75571 Paris Cedex 12, France

C. P. Malta Instituto de F´ısica, Universidade de S˜ao Paulo, CP 66318, 05389-970 S˜ao Paulo, Brazil

C. Grotta-Ragazzo Instituto de Matem´atica e Estat´ıstica, Universidade de S˜ao Paulo, CP 66281, 05389-970 S˜ao Paulo, Brazil

J.-F. Vibert INSERM V444 ISARS Facult´e de M´edecine Saint-Antoine, 75571 Paris Cedex 12, France

Little attention has been paid in the past to the effects of interunit transmission delays (representing axonal and synaptic delays) on the boundary of the basin of attraction of stable equilibrium points in neural networks. As a first step toward a better understanding of the influence of delay, we study the dynamics of a single graded-response neuron with a delayed excitatory self-connection. The behavior of this system is representative of that of a family of networks composed of graded-response neurons in which most trajectories converge to stable equilibrium points for any delay value. It is shown that changing the delay modifies the “location” of the boundary of the basin of attraction of the stable equilibrium points without affecting the stability of the equilibria. The dynamics of trajectories on the boundary are also delay dependent and influence the transient regime of trajectories within the adjacent basins. Our results suggest that when dealing with networks with delay, it is important to study not only the effect of the delay on the asymptotic convergence of the system but also on the boundary of the basins of attraction of the equilibria. 1 Introduction The time it takes for a signal to be transmitted from one neuron to another, referred to here as delay, can influence the behavior of neural network models. For example, in spiking neuron models, the firing pattern of single units as well as the coherence between the neurons’ discharge may drastically change as the delay is varied across critical values (an der Heiden 1981; Plant 1981; Vibert et al. 1994; Gerstner 1995; Pakdaman et al. 1996). In artificial Neural Computation 9, 319–336 (1997)

c 1997 Massachusetts Institute of Technology °

320

K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo

neural network (ANN) applications, it has been shown that implementing delays in discrete-time networks made of binary units enables them to store time-varying sequences (Sompolinsky and Kanter 1986; Herz et al. 1988). In some ANN applications using analog graded-response neuron (GRN) models, delays arising in hardware implementation may cause the network performance to deteriorate. For example, in a content-addressable memory network, stable equilibrium points are used for storing information (Hopfield 1984; Hirsch 1989), and delays due to hardware implementation may render a locally stable equilibrium point unstable, thus causing loss of the corresponding stored information (Marcus and Westervelt 1989). Such considerations have motivated a number of studies on the asymptotic behavior of GRN networks with delay (Babcock and Westervelt 1987; Marcus and Westervelt 1989; Burton 1991, 1993; Marcus et al. 1991; Roska and Chua 1992; Roska et al. 1992, 1993; B´elair 1993; Civalleri et al. 1993; Gilli 1993; Gopalsamy and He 1994a, 1994b; Ye et al. 1994, 1995; Finnochiaro and Perfetti 1995; Gilli 1995; Gopalsamy and Leung 1996). Studies of the local stability of the equilibria have developed criteria to avoid delay-induced instabilities in some networks (Marcus and Westerwelt 1989; B´elair 1993). Constraints on network parameters have also been provided for all, or almost all, trajectories to converge to stable equilibrium points in networks with delay (Burton 1991, 1993; Roska and Chua 1992; Roska et al. 1992, 1993; B´elair 1993; Civalleri et al. 1993; Gopalsamy and He 1994a, 1994b; Ye et al. 1994, 1995). One of the purposes of such studies is to ensure that the asymptotic behavior of the network with delay is similar to that of the corresponding network without delay. However, even in such (almost) convergent networks, important features in the system’s dynamics may still depend on the delay value. For instance, changing the delay can alter the boundary of the basins of attraction of the stable equilibrium points. This can be of prime importance in a content-addressable memory network: the position of the basin boundaries determines in which basin given information falls. Thus, changing the shape of the basin boundaries alters the classification of initial conditions performed by the network. We study how the delay may alter the basin boundary in a network constituted by a single GRN with a delayed excitatory self-connection. This system has been chosen because: • It is simple enough so that thorough theoretical and numerical analysis of its dynamics is possible. • The asymptotic behavior of the system is similar to that of the corresponding system without delay in the sense that the local asymptotic stability of the equilibria is not affected by the delay and most trajectories converge to stable equilibrium points for any value of the delay, allowing a focus on the effect of delay on the basin boundary. • The results are representative of the influence of delay on basin boundaries in larger networks.

Effect of Delay on the Boundary of the Basin of Attraction

321

In Section 2, we present the neuron model. The stationary regime is described in Section 3. The boundary between the basins of attraction of two locally asymptotically stable equilibrium points is estimated for different delay values in Section 4. The behavior of solutions on the boundary and their influence on the transient regime of the system are presented in Section 5. Generalizations of the results are discussed in Section 6. Mathematical aspects are left to the appendices. 2 The Graded-Response Neuron Model The GRN is described by its activation at time t, denoted a(t), and a sigmoidal output function σ (a). A decay rate γ is also implemented in the model (Cowan and Ermentrout 1978; Hopfield 1984; Pasemann 1993). We consider a neuron that has a delayed self-connection with strictly positive weight W and delay A. The neuron receives a constant input K. The neuron activation evolves according to the following delay differential equation (DDE): da (t) = −γ a(t) + K + Wσ (a(t − A)), dt

(2.1)

where the neuron transfer function σ is defined by σ (a) =

1 . 1 + e−a

(2.2)

Here, the initial condition of the system is constituted by the history of the neuron activation during a time interval corresponding to the delay. Thus, initial conditions for DDE 2.1 are the continuous functions φ defined on the interval [−A, 0] of length equal to the delay. Since changing the delay alters this interval, we need to specify how to compare initial conditions for a given delay with those for another value. We present two methods. Using the first method, an initial condition defined for a given delay is restricted to a shorter interval and thus defines an initial condition for a shorter delay. The second method consists of rescaling the time unit to the delay, so that for all delays, initial conditions are defined on the same interval of unit length [−1, 0]. Thus DDE 2.1 is transformed into: 1 dˆa 0 (t ) = −ˆa(t0 ) + K0 + W 0 σ (ˆa(t0 − 1)) τ dt0

(2.3)

where t0 = t/A, aˆ (t0 ) = a(t), and the parameters have been renamed to τ = γ A, K0 = K/γ , and W 0 = W/γ . 3 The Stationary Regime The asymptotic behavior of DDE 2.1 is analyzed in Appendix 7. The results can be summarized as follows: We define the gain as g = W/4 and the

322

K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo

dissipation as γ , the activation decay rate. For g < γ , there is one globally asymptotically stable point, noted x0 . For g > γ , there are two input values K− and K+ defined as: ! Ã p p W − W(W − 4γ ) W − 2γ + W(W − 4γ ) K− (γ , W) = −γ Log − 2γ 2 Ã K+ (γ , W) = γ Log

W − 2γ +

! p p W + W(W − 4γ ) W(W − 4γ ) − 2γ 2

(3.1)

such that when either K < K− or K > K+ , there is one globally asymptotically stable point, also noted x0 ; and for K− < K < K+ , the system is bistable: there are two locally asymptotically stable equilibria, noted x1 and x3 , and one unstable equilibrium point noted x2 , with x1 < x2 < x3 . From this point on, we suppose g > γ and K− < K < K+ . The set of initial conditions whose trajectories tend to a stable equilibrium point is referred to as the basin of attraction of that point. For any initial condition φ, there is a unique real number b1 (φ), such that φ +c is in the basin of attraction of x1 (respectively x3 ) if and only if c < b1 (φ) (respectively c > b1 (φ)). φ + b1 (φ) is in the boundary separating the two basins of attraction. Thus this boundary is the set of the zeros of b1 . All initial conditions φ “below” (respectively “above”) the boundary are in the basin of attraction of x1 (resp. x3 ), and satisfy b1 (φ) > 0 (respectively b1 (φ) < 0). 4 Numerical Determination of the Boundary The description presented in the previous section characterizes the boundary separating the two basins of attraction. We investigate numerically the dependence of the boundary on the delay. As the space of the initial conditions of DDE 2.1 is an infinite dimensional space, it is not possible to visualize the boundary between the basins in that space. Instead, sections of the boundary corresponding to families of initial conditions are computed as described below. A family of initial conditions depending on one parameter is selected (φα (t)). For each value of the real parameter α, we note β(α) = b1 (φα ). The orbit of an initial condition φ(t) = φα (t) + c converges to x1 or x3 depending on whether c is strictly smaller than or strictly larger than β(α). Thus the graph of β(α) in the (α, c) parameter plane represents the boundary separating the basins of attraction for initial conditions φ defined as above: points with coordinates (α, c) situated “below” (respectively “above”) the graph of β(α) correspond exactly to the functions φ with parameter α and bias c (φ(t) = φα (t) + c) whose orbit tends to x1 (respectively x3 ). This allows us to compute the basin boundary for a special family of initial conditions as a function of the parameter α and to represent it as a onedimensional graph. This method can be generalized to families depending

Effect of Delay on the Boundary of the Basin of Attraction

323

on two parameters (φα,ω (t)) for which the basin boundary β(α, ω) = b1 (φα,ω ) is a two-dimensional surface in the (α, ω, c) parameter space. The following two examples illustrate the method. In Figure 1, numerical and theoretical estimates of the boundary between the two basins for special families of initial conditions are presented. Figure 1a shows estimates of the basin boundary for affine functions φα (t) = αt, for −1 ≤ t ≤ 0. For a given delay A, we solve the rescaled equation (2.3) with the initial condition φα +c. We note βr (α) the value of β obtained for this family of initial conditions. It has been estimated for two different delays: A = 0.5 (dashed lines), A = 15 (solid lines). The thick lines were obtained by solving numerically the DDE, and the thin lines result from the theoretical approximation (see equation A.3). This approximation of βr (α) is a linear function. It fits the numerical result over some interval of α around 0. However, the length of this interval shrinks as the delay is decreased. For α = 0, the initial condition is constant, and all four graphs of β go through βr (0) = x2 as expected. DDE 2.1 preserves the order of initial conditions, so that βr (α) is an increasing function of α, and its graph has a positive slope. For delays close to zero, the slope tends to zero, and the graph of βr (α) is close to the straight horizontal line going through x2 = 0. This slope increases with the delay. The basin boundary can also be estimated for a restricted (if A < 1) or extended (if A > 1) initial condition defined by φα (t) = αt for −A ≤ t ≤ 0. We denote by βe (α) the value obtained in this way. Then we have: βe (Aα) = βr (α).

(4.1)

This relation allows us to obtain an estimate of the basin boundary for the first method of comparison—restriction—from the results presented in the previous paragraph. Figures 1b and 1c show estimates of the basin boundary for sine initial conditions (φα,ω (t) = α sin(ωt), −1 ≤ t ≤ 0). For each value of (α, ω), βr (α, ω) is shown for delays A = 0.5 (Fig. 1b1) and A = 5 (Fig. 1b2). Both theoretical approximations and the results of the numerical resolution of the DDE are represented for each delay. In both figures, the theoretical estimation matches the numerical one for low-amplitude sine functions (α close to zero). For a fixed ω, βr (α, ω) is a monotonous function of α that either increases or decreases from zero as α increases from zero. For a fixed α, the boundary displays oscillations that are damped as ω increases (Fig. 1c). The main difference due to the change in the delay is in the amplitude of the “waves” of the boundary surface. For the sake of brevity, only basin boundaries for rescaled initial conditions were presented. For restricted or extended initial conditions, the results are similar. In fact there is a relation similar to equation 4.1 linking the basin’s boundary of rescaled sine initial conditions to that of extended or restricted sine initial conditions.

324

K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo

5 Oscillations on the Boundary The boundary separating the two basins of attraction is formed by the solutions that oscillate indefinitely around x2 ; that is, φ belongs to the boundary if and only if for all t ≥ 0, there is θ with −A ≤ θ ≤ 0 such that a(t + θ ) = x2 .

Effect of Delay on the Boundary of the Basin of Attraction

325

For small delays, these oscillatory solutions are damped to x2 ; for large delays, they are either damped to x2 or asymptotically periodic. Such oscillations are due to the delay. Although they are unstable, they have an important influence on the behavior of the system. Solutions close to the boundary display transient oscillations before converging to one of the two stable equilibria x1 or x3 (Fig. 2a). Such transient oscillations are observed for the solution of any initial condition φ, which crosses x2 (i.e., the function φ − x2 changes sign on [−A, 0]). As the delay is increased, the duration of the transient oscillations increases exponentially with the delay (compare Fig. 2a with Fig. 2b). The transient oscillations and the lengthening of their duration with the delay are due to the fact that for large delays, solutions of DDE 2.1 behave transiently as the corresponding solutions of the following difference equation (Sharkovsky et al. 1993): a(t + A) =

K W σ (a(t)) + . γ γ

(5.1)

In contrast with DDE 2.1, the difference equation 5.1 has (infinitely many) nonconstant periodic solutions with nonnegligible basins of attraction for W > 4γ and K− < K < K+ . Thus for large enough delays, solutions of DDE 2.1, corresponding to initial conditions lying in the basin of attraction of a periodic solution of the difference equation 5.1, are transiently attracted to this nonconstant periodic solution before escaping toward one of the two stable equilibria. 6 Generalization Networks in which there is at least one directed path connecting any given neuron to any other one are referred to as irreducible. Understanding the

Figure 1: Facing page. The boundary separating the basins of attraction. (a) The boundary βr (α) of the basin of attraction for affine initial conditions with slope α and bias c, φα (t) = αt + c (−1 ≤ t ≤ 0), is shown for A = 0.5 (dashed line) and A = 15 (solid line). The thick lines correspond to the numerical result, and the thin lines correspond to the theoretical approximation (equation A.3) Abscissae: slope α, ordinates: bias c. (b) The boundary βr (α, ω) of the basin of attraction for sine initial conditions with amplitude α, angular velocity ω, and bias c: φα,ω (t) = α sin(ωt) tc (−1 ≤ t ≤ 0) is shown for A = 0.5 (1) and A = 5 (2). (c) Cross sections at α = 5 of β(α, ω). The thick lines correspond to the numerical result, and the thin lines correspond to the theoretical approximation (equation A.3). Solid lines for the long delay (A = 5) and dashed lines for the short delay (A = 0.5). Abscissae: angular velocity ω, ordinates: bias c. Parameters used: γ = 1, W = 6, K = −3.

326

K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo

Figure 2: Examples of trajectories. (a) Temporal evolution of the solution going through the initial condition: φ(t) = sin(4πt) − 0.02 for −0.5 ≤ t ≤ 0 (thin line) for a delay A = 0.5 and ψ(t) = sin(π t) − 0.02 for −2 ≤ t ≤ 0 (thick line) for a delay A = 2. Both initial conditions correspond to the same function sin(2π t) − 0.02 when the time unit is rescaled to the delay. The solutions tend to different equilibrium points depending on the delay value. Abscissa: time t; ordinates: activation a. (b) Temporal evolution of the solution going through the initial condition φ(t) = sin(0.2π t)−0.02 for −10 ≤ t ≤ 0, for a delay A = 10. This solution exhibits long-lasting transient oscillations before stabilizing close to the stable equilibrium point. Abscissa: time t (note the difference in the abscissa scale between A and B); ordinates: activation a. Parameters used: γ = 1, W = 6, and K = −3.

Effect of Delay on the Boundary of the Basin of Attraction

327

dynamics of such networks is important because they form the building blocks of other networks, such as cascades, and under appropriate conditions, the asymptotic behavior of a network can be derived from that of its constituting irreducible networks in cascade (Hirsch 1989). The results presented in Section 3 for the single self-exciting GRN can be generalized to excitatory irreducible GRN networks with delay. Details of the generalization are presented in Pakdaman et al. (1995c). The dynamics of an n neuron network evolve according to the following equations:

n X dai (t) = −γ ai (t) + Ki + Wij σj (aj (t − Aij )) dt j=1

1 ≤ i ≤ n.

(6.1)

The neuron transfer functions are taken to be sigmoidal, that is, σi is a smooth, bounded, strictly increasing function, such that there is a unique point pi such that σi00 (pi ) = 0. The connection matrix W = [Wij ] is assumed positive (Wij ≥ 0 for all i, j) and irreducible; it does not leave invariant any proper nontrivial subspace generated by a subset of the standard basis vectors for Rn . We define the gain g as the largest (real) eigenvalue of the n × n matrix [Wij σj0 (pj )] and the dissipation γ in the same way as for the single neuron. When g < γ , the network is globally asymptotically stable. When both g > γ and the absolute value of the input received by each unit in the network is large enough, the network is globally asymptotically stable. When g > γ , depending on the input, there can be several equilibrium points. In such cases, the basin boundaries are a finite number of hypersurfaces (codimension one manifolds) that partition the phase space into disjoint regions corresponding to the basins of attraction of the locally stable equilibria. The boundaries form the “upper” or “lower” bounds of a given basin of attraction; in the same way as for the single self-exciting neuron, the boundary is the upper bound of the basin of attraction of x1 and the lower bound of the basin of attraction of x3 . The boundaries contain necessarily unstable points, and the trajectories that oscillate around them. Similar results are also expected when the input K is no longer constant and is allowed to vary with time, as argued in Section B.1. The influence of delay on network dynamics has also been investigated in networks composed of a different version of the GRN, usually referred to as the “shunting model” (Destexhe and Gaspard 1993; Destexhe 1994; Houweling 1994; Louren¸co and Babloyantz 1994). The results presented in this article can be adapted to the dynamics of a single shunting model neuron with a delayed excitatory self-connection receiving a constant input (see Section B.2).

328

K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo

7 Discussion and Conclusion Dynamical systems with delayed feedback evolve in infinite dimensional phase spaces and therefore may display complex behavior that has no counterpart in corresponding systems without delay. For example, scalar systems with delayed feedback can display complex dynamics depending on the feedback nature—positive, negative, or mixed (an der Heiden and Mackey 1982; Malta and Grotta-Ragazzo 1991)—and the number of feedback loops (Glass and Malta 1990; Grotta-Ragazzo and Malta 1992). Moreover, the basin boundary in such systems can have an intricate structure (Losson et al. 1993). GRN neural networks with delayed interaction are also known to display delay-induced complex chaotic behavior (Marcus et al. 1991; Gilli 1993, 1995). In order to avoid such phenomena, the network parameters, such as the neurons’ gain, the connection weight, and the delays, can be restricted to ranges in which the asymptotic behavior of the network with delay is similar to that of the corresponding network without delay for almost all initial conditions. We studied the dynamics of a single GRN with a delayed self-exciting connection, which satisfies the above condition. For this system, we derived theoretical and numerical descriptions of the boundary separating the basins of attraction of two locally stable equilibrium points. The description of the asymptotic behavior of the system is sketched in Figure 3 and can be summarized as follows: The boundary separating the basins of attraction of the two stable equilibria divides the phase space into two regions, in the same way a plane divides the three-dimensional space (gray surface in Fig. 3). Initial conditions on each side of the boundary form the basin of attraction of one of the two stable equilibria. Moreover, solutions on the boundary oscillate around the unstable point (closed trajectory in Fig. 3). Two effects of the delay on the system behavior were then presented: 1. The location of the basin boundary changes with the delay. Although the system with delay has exactly the same stable equilibria as the corresponding system without delay and almost all solutions tend to these stable equilibria, the solution of a given initial condition may tend to either of the two stable equilibria or even remain on the boundary depending on the value of the delay (Fig. 2a). In this respect, the delay should be considered among important parameters shaping the basin of attraction of equilibria. For a content-addressable memory network, this means that the information retrieved for an initial condition may change if the delay is modified. The asymptotic behavior of the single neuron with a delayed self-exciting connection is representative of excitatory irreducible networks. Moreover, as indicated by Hirsch (1989), other networks, in-

Effect of Delay on the Boundary of the Basin of Attraction

329

Figure 3: Schematic representation of the basin boundary. The boundary separating the basins of attraction of the two stable equilibria x1 and x3 is schematically represented by a surface going through the unstable point x2 and a periodic solution (the closed trajectory). Solutions below (respectively above) the boundary converge to x1 (respectively x3 ). However, some are transiently attracted to the oscillatory solutions on the boundary, before eventually approaching one of the two stable equilibrium points, as exemplified in the figure.

cluding negative connection weights, can also be studied in the same way, for example, after appropriate changes of variables. This possibility is discussed in length in Hirsch (1989), and we refer to that paper for general results as well as specific examples. 2. Delay-induced transient oscillations. Due to the presence of sustained oscillations on the boundary, and attracting undamped oscillations at the limit of infinite delays, solutions of some initial conditions display transient oscillations before converging to one of the two stable equilibria (Fig. 2b). As the delay is increased, more and more solutions display such behavior, and the duration of the transient oscillations considerably increases. In many ANN applications, information processing by the network terminates when the system stabilizes in one of its equilibria (Hopfield 1984; Hirsch 1989). Thus, delay-induced transient oscilla-

330

K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo

tions may considerably slow information processing by lengthening the transient regime duration. Larger networks also display delay-induced transient oscillations. These were reported in two-neuron networks from numerical investigations (Babcock and Westervelt 1987) and from both numerical and analytical studies (Pakdaman et al. 1995b). The long-lasting transient oscillations are due to the presence of attracting nonconstant periodic solutions of the discrete time network associated with the continuous time network. Discrete time neural networks and, in general, discrete time dynamical systems are more prone to display periodic (or oscillatory) solutions than corresponding continuous time systems; therefore, delay-induced transient oscillations can be expected in many neural networks with delay. In conclusion, for neural networks designed to converge to equilibria, such as content-addressable memories, the fractions of errors in retrieving the relevant information depend on the shape of the basin of attraction of the equilibrium points. We have shown that even in an almost convergent system, the boundary of the basins may be modified by the presence of delays. This can cause the performance to deteriorate if the delay is not under control. Appendix A: Asymptotic Behavior The neuron activation evolves according to the following DDE: da (t) = −γ a(t) + K + Wσ (a(t − A)), dt

(A.1)

where σ is the sigmoidal function defined in equation 2.2 and W and A are strictly positive real numbers. This system is similar to the positive feedback loop with a piecewise constant transfer function studied by an der Heiden and Mackey (1982), as well as several models analyzed by Sharkovsky and collaborators (1993). Let C [−A, 0] be the space of continuous real functions of the interval [−A, 0]. For φ in C [−A, 0], there exists a unique real function a(t, φ) on the interval [−A, +∞), such that a(t, φ) = φ(t) for −A ≤ t ≤ 0, and a(t, φ) satisfies equation A.1 for t ≥ 0 (Hale and Verduyn Lunel 1993). For such a solution of the DDE, we note at (φ) the element of C [−A, 0], defined by at (φ)(θ) = a(t + θ, φ), for −A ≤ θ ≤ 0. Let φ0 and φ1 be two elements in C [−A, 0]; then we say that φ0 is larger (respectively strictly larger) than φ1 noted φ0 ≥ φ1 (respectively φ0 À φ1 ) if for all θ in [−A, 0] we have φ0 (θ ) ≥ φ1 (θ ) (respectively φ0 (θ ) > φ1 (θ )). DDE A.1 generates an order-preserving semiflow as for φ0 and φ1 in C [−A, 0]

Effect of Delay on the Boundary of the Basin of Attraction

331

we have: if φ0 ≥ φ1 and φ0 6= φ1 , then for t > 2A

at (φ0 ) À at (φ1 ).

(A.2)

The fact that DDE A.1 generates an order-preserving semiflow strongly limits the possible asymptotic behaviors of the solutions (Smith 1987; Roska et al. 1992). The number of the equilibrium points of DDE A.1, as well as their stability, is deduced from the analyses presented in Cowan and Ermentrout (1978) and Pasemann (1993) and the fact that the local stability of the stable equilibria in DDEs generating strongly order-preserving semiflows is not affected by the delay (Smith 1987). • For g = W/4 < γ , all solutions converge uniformly asymptotically to x0 . • For g = W/4 > γ 1. For either K < K− or K > K+ , all solutions converge uniformly asymptotically to x0 . 2. For K− < K < K+ , the union of the basins of attraction of x1 and x3 is a dense open subset of C [−A, 0]. Let u be a strictly positive function defined on the interval [−A, 0]. For φ in C [−A, 0], there is a unique real number bu (φ) such that at (φ + c.u) tends asymptotically to x1 (respectively x3 ) for all c < bu (φ) (respectively c > bu (φ)); where for a real number c, we note φ + c.u the element of C [−A, 0] defined by (φ + c.u)(t) = φ(t) + c.u(t), for −A ≤ t ≤ 0. a(t, φ + bu (φ).u) oscillates around x2 , in the sense that the function a(t, φ + bu (φ).u) − x2 has at least one zero on each interval kA < t < (k + 1)A, for k ≥ −1. The boundary of the basin of attraction of the two locally stable equilibrium points is formed by such oscillating solutions. For small delays these oscillations are damped to x2 ; for large delays they are either damped to x2 or asymptotically periodic (Arino 1993; Mallet-Paret and Sell 1996). bu is a strictly decreasing continuous map from C [−A, 0] to the real line. The boundary of the basins of attraction of the two locally stable equilibria is the closed set {φ, such that bu (φ) = 0}. More precisely, the boundary is a codimension one, an unordered Lipschitz manifold (Tak´acˇ 1991). Let ν = νA be the real solution of the characteristic equation associated with DDE A.1 at x2 , and let u(t) = eνt , for −A ≤ t ≤ 0, be the most unstable eigendirection of the linearized equation at x2 . We approximate bu (φ) by pu (φ), where −pu (φ) is the projection of φ onto the line directed by u along the eigenspace of all the other solutions of the characteristic equation of DDE A.1

332

K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo

(Hale and Verduyn Lunel 1993): pu (φ) = x2 − φ(0) + Wσ 0 (x2 )e−ν.A

Z

0

−A

(x2 − φ(u))e−ν.u du. (A.3)

This approximation is satisfactory near the unstable equilibrium. When the characteristic equation has a single solution with a positive real part, close to x2 , the basin boundary coincides locally with the stable manifold of the unstable equilibrium point x2 . In this case, the zeros of pu represent the tangent space to the stable manifold at x2 . Appendix B: Generalization B.1 Time-Varying Inputs. Neural networks may be used for processing time-varying inputs. In this case, the input K varies with time. The nonautonomous system thus obtained is also order-preserving. For periodic input signals of period T, the time-T map associated with the system is a strongly order-preserving discrete-time map, and there are a number of results on the fixed point of such maps and also on the basins of attractions of the stable fixed points that generalize those presented in the previous sections (Tak´acˇ 1992; Hirsch 1995). The behavior of a single GRN neuron with a delayed self-connection in the presence of slow periodic inputs has been investigated numerically (Pakdaman et al. 1995a). The general study of the behavior of GRN networks with delay in response to periodic inputs will be presented elsewhere. B.2 The Shunting Model. In the shunting model, the neuron activation evolves according to the following DDE: da (t) = −γ a(t) + K + W(E − a(t))σ (a(t − A)) , dt

(B.1)

where E > 0 is the positive reversal potential. Let K0 = K/γ ; we further suppose E > K0 . For any function φ in C [−A, 0], there is a unique solution of DDE B.1 a(t, φ) going through φ, and there is a time T ≥ −A such that the activation is below the reversal potential after T; that is, for all t ≥ T, a(t, φ) < E. The restriction of the set of initial conditions of DDE B.1 to continuous functions on the interval [−A, 0] that are smaller than E generates a strictly orderpreserving semiflow. Based on the above considerations and the study of the local stability of the equilibria of DDE B.1, we can state the following: Let K0 be the unique solution of the following equation, such that K0 < E: (eK0 +2 − 1)(E − K0 ) + 4 = 0 .

(B.2)

Effect of Delay on the Boundary of the Basin of Attraction

333

For K0 > K0 there is one globally asymptotically stable point, noted x0 , with K0 < x0 < E. For K0 > K0 , there are two positive weight values W− and W+ such that when either W < W− or W > W+ , there is one globally asymptotically stable point, also noted x0 with K0 < x0 < E; and for W− < W < W+ , the system is bistable: there are two locally asymptotically stable equilibria, noted x1 and x3 , and one unstable equilibrium point noted x2 , with K0 < x1 < x2 < x3 < E. For the bistable system, most trajectories converge to the stable equilibrium points in the sense that the union of the basins of attraction of the two stable equilibrium points is a dense open set. The boundary separating the two basins of attraction can be characterized in the same way as for the GRN. Acknowledgments We thank E. Av-Ron and Pr. O. Arino for helpful comments. This work was partially supported by USP/COFECUB under project U/C 9/94. One of us (CPM) is also partially supported by CNPq (the Brazilian Research Council). References Arino, O. 1993. A note on “the discrete Lyapunov function . . . ” Journal of Differential Equations 104, 169–181. Babcock, K. L., and Westervelt, R. M. 1987. Dynamics of simple electronic neural networks. Physica D 28, 305–316. B´elair, J. 1993. Stability in a model of a delayed neural network. Journal of Dynamics and Differential Equations 5, 607–623. Burton, T. A. 1991. Neural networks with memory. Journal of Applied Mathematics and Stochastic Analysis 4, 313–332. Burton, T. A. 1993. Averaged neural networks. Neural Networks 6, 677–680. Civalleri, P. P., Gilli, M., and Pandolfi, L. 1993. On stability of cellular neural networks with delay. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 40, 157–165. Cowan, J., and Ermentrout G. B. 1978. Some aspects of the “eigenbehavior” of neural nets. In Studies in Mathematical Biology, part I, S. A. Levin, ed., 67–117. Mathematical Association of America, Providence, RI. Destexhe, A. 1994. Oscillations, complex spatiotemporal behavior, and information transport in networks of excitatory and inhibitory neurons. Physical Review E 50, 1594–1606. Destexhe, A., and Gaspard, P. 1993. Bursting oscillations from a homoclinic tangency in a time delay system. Physics Letters A 173, 386–391. Finnochiaro, M., and Perfetti, R. 1995. Relation between template spectrum and stability of cellular neural networks with delay. Electronics Letters 31, 2024– 2026. Gerstner, W. 1995. Time structure of the activity in neural network models. Physical Review E 51, 738–758.

334

K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo

Gilli, M. 1993. Strange attractors in delayed cellular neural networks. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 40, 849– 853. Gilli, M. 1995. A spectral approach for chaos prediction in delayed cellular neural networks. International Journal of Bifurcation and Chaos 5, 869–875. Glass, L., and Malta, C. P. 1990. Chaos in multi-looped negative feedback systems. Journal of Theoretical Biology 145, 217–223. Gopalsamy, K., and He, X.-Z. 1994a. Stability in asymmetric Hopfield nets with transmission delays. Physica D 76, 344–358. Gopalsamy, K., and He, X.-Z. 1994b. Delay-independent stability in bidirectional associative memory networks. IEEE Transactions on Neural Networks 5, 998– 1002. Gopalsamy, K., and Leung, I. 1996. Delay-induced periodicity in a neural netlet of excitation and inhibition. Physica D 89, 395–426. Grotta-Ragazzo, C., and Malta, C. P. 1992. Singularity structure of the Hopf bifurcation surface of a differential equation with two delays. Journal of Dynamics and Differential Equations 4, 617–650. Hale, J. K., and Verduyn Lunel, S. M. 1993. Introduction to Functional Differential Equations. Applied Mathematical Sciences vol. 99. Springer-Verlag, New York. an der Heiden, U. 1981. Analysis of Neural Networks. Lecture Notes in Biomathematics vol. 35. Springer-Verlag, New York. an der Heiden, U., and Mackey, M. C. 1982. The dynamics of production and destruction: Analytic insight into complex behavior. Journal of Mathematical Biology 16, 75–101. Herz, A., Sulzer, B., Kuhn, ¨ R., and Van Hemmen, J. L. 1988. The Hebb rule: Storing static and dynamic objects in an associative neural network. Europhysics Letters 7, 663–669. Hirsch, M. W. 1989. Convergent activation dynamics in continuous time networks. Neural Networks 2, 331–350. Hirsch, M. W. 1995. Fixed points of monotone maps. Journal of Differential Equations 123, 171–179. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science USA 81, 3088–3092. Houweling, A. R. 1994. The effects of interneural time delay in a simple two neuron shunting model. Personal communication. Losson, J., Mackey, M. C., and Longtin, A. 1993. Solution multistability in firstorder nonlinear differential delay equations. Chaos 3, 167–176. Louren¸co, C., and Babloyantz, A. 1994. Control of chaos in networks with delay: A model for synchronization of cortical tissue. Neural Computation 6, 1141– 1154. Mallet-Paret, J., and Sell, G. R. 1996. The Poincar´e Bendixson theorem for monotone cyclic feedback systems with delay. Journal of Differential Equations 125, 441–489.

Effect of Delay on the Boundary of the Basin of Attraction

335

Malta, C. P., and Grotta-Ragazzo, C. 1991. Bifurcation structure of scalar delayed equations. International Journal of Bifurcation and Chaos 1, 657–665. Marcus, C. M., Waugh, F. R., and Westervelt, R. M. 1991. Nonlinear dynamics and stability of analog neural networks. Physica D 51, 234–247. Marcus, C. M., and Westervelt, R. M. 1989. Stability of analog neural networks with delay. Physical Review A 39, 347–359. Pakdaman, K., Alvarez, F., Diez-Mart´ınez, O., and Vibert, J.-F. 1995a. Single neuron with recurrent connection: Response to slow periodic modulation. Biosystems. In press. Pakdaman, K., Grotta-Ragazzo, C., Malta, C. P., and Vibert, J.-F. 1995b. Delayinduced transient oscillations in a two-neuron network. Tech. Rep., Publica¸coes ˜ IFUSP/P-1183, Universidade de S˜ao Paulo, Brazil. Pakdaman, K., Malta, C. P., Grotta-Ragazzo, C., and Vibert, J.-F. 1995c. Asymptotic behavior of excitatory irreducible networks of graded-response neurons. Tech. Rep., Publica¸coes ˜ IFUSP/P-1200, Universidade de S˜ao Paulo, Brazil. Pakdaman, K., Vibert, J.-F., Boussard, E., and Azmy, N. 1996. Single neuron with recurrent excitation: Effect of the transmission delay. Neural Networks 9, 797–818. Pasemann, F. 1993. Dynamics of a single neuron model. International Journal of Bifurcation and Chaos 3, 271–278. Plant, R. E. 1981. A FitzHugh differential-difference equation modeling recurrent neural feedback. SIAM Journal of Applied Mathematics 40, 150–162. Roska, T., and Chua, L. O. 1992. Cellular neural networks with non-linear and delay-type template elements and non-uniform grids. International Journal of Circuit Theory and Applications 20, 469–481. Roska, T., Wu, C. F., Balsi, M., and Chua, L. O. 1992. Stability and dynamics of delay-type general and cellular neural networks. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 39, 487–490. Roska, T., Wu, C. F., and Chua, L. O. 1993. Stability of cellular neural networks with dominant nonlinear and delay-type templates. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 40, 270–272. Sharkovsky, A. N., Maistrenko, Yu. L., and Romanenko, E. Yu. 1993. Difference Equations and Their Applications. Kluwer, Dordrecht. Smith, H. 1987. Monotone semiflows generated by functional differential equations. Journal of Differential Equations 66, 420–442. Sompolinsky, H., and Kanter, I. 1986. Temporal association in asymmetric neural network. Physical Review Letter 57, 2861–2864. Tak´acˇ , P. 1991. Domains of attraction of generic omega-limit sets for strongly monotone semi-flows. Zeitschrift fur Analysis und ihre Anwendungen 10, 275– 317. Tak´acˇ , P. 1992. Domains of attraction of generic ω-limit sets for strongly mono¨ die reine und angewandte Mathematik tone discrete-time semigroups. Journal fur 423, 101–173. Vibert, J.-F., Pakdaman, K., and Azmy, N. 1994. Inter-neural delay modification synchronizes biologically plausible neurons. Neural Networks 7, 589–607.

336

K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo

Ye, H., Michel, A. N., and Wang, K. 1994. Global stability and local stability of Hopfield neural networks with delays. Physical Review E 50, 4206–4213. Ye, H., Michel, A. N., and Wang, K. 1995. Qualitative analysis of CohenGrossberg neural networks with multiple delays. Physical Review E 51, 2611– 2618.

Received June 16, 1995; accepted May 6, 1996.

Communicated by Terrence Fine

Shattering All Sets of k Points in “General Position” Requires (k − 1)/2 Parameters Eduardo D. Sontag Department of Mathematics, Rutgers University, New Brunswick, NJ 08903 USA

For classes of concepts defined by certain classes of analytic functions depending on n parameters, there are nonempty open sets of samples of length 2n + 2 that cannot be shattered. A slighly weaker result is also proved for piecewise-analytic functions. The special case of neural networks is discussed.

1 Introduction The generalization capabilities of neural networks are often quantified in terms of the maximal number of possible binary classifications that could be obtained, by means of weight assignments, on any given set of input patterns. Obviously the larger the number of such potential classifications, the lower the predictability that is possible on the basis of a partial assignment (“loading of training data”) already achieved. Thus, it is of interest to obtain useful upper bounds on this number and in particular to study the Vapnik-Chervonenkis (VC) dimension, which is the size of the largest set of inputs that can be shattered (arbitrary binary labeling is possible). Recent results (cf. Koiran and Sontag in press and also Maass 1994 for closely related work and a survey) show that the VC dimension grows at least as fast as the square n2 of the number of adjustable weights n in the net, and this number might grow as fast as n4 (Karpinski and Macintyre in press). These results are quite pessimistic, since they imply that the number of samples required for reliable generalization, in the sense of PAC learning, is very high. On the other hand, it is conceivable that those sets of input patterns that can be shattered are all in some sense “special” and that if we ask instead, as done in the classical literature in pattern recognition, for the shattering of all sets in “general position” (e.g., Cover’s work on capacity of perceptrons, Cover 1988), then an upper bound of O(n) might hold. Strong evidence for this possibility was provided by Adam Kowalczyk (in press), who showed that this indeed happens for the (very special) case of hard-threshold neural networks with a fully connected first layer. In this article, I establish a linear upper bound for arbitrary sigmoidal (as well as threshold) neural nets, proving that in that sense, the perceptron results can be recovered in a strong sense (up to a factor of two). Neural Computation 9, 337–348 (1997)

c 1997 Massachusetts Institute of Technology °

338

Eduardo D. Sontag

There is potential relevance of the results to variations of PAC learning when one weakens the requirement that generalization capabilities must hold with respect to all possible input distributions; this will be the subject of future work. The estimate is also useful in the very different context of understanding computational abilities; as an illustration, note that the main result is being employed by Maass (in press) for contrasting the computational power of spiking neurons (Maass 1996) with that of sigmoidal neural networks. 1.1 Parametric Classes, Shattering. Assume given two positive integers n and m and a function β: Rn × Rm → R . In this context we will call n the number of parameters, elements of Rn parameter vectors (or weights), m the input dimension, elements of Rm input vectors, and β a response function. We think of β(x, u) as a function of the inputs u for each parameter vector x. For each integer k, a k-tuple of the form uE = (u1 , . . . , uk ) ∈ (Rm )k = Rmk will be called an (ordered) sample of length k. We want to measure how rich the set of functions {β(x, ·), x ∈ Rn } is for binary classification purposes. Consider a sample uE = (u1 , . . . , uk ) of length k. A sequence εE = (ε1 , . . . , εk ) ∈ {−1, 1}k will be said to be assignable to uE (with respect to the given response function) if there exists some parameter vector x so that sign (β(x, ui )) = εi , i = 1, . . . , k, where we are defining sign (y) = 1 if y > 0 and −1 otherwise. We will say here that a sample uE of length k is ±-shattered (by β) if, for each εE ∈ {−1, 1}k , either εE or −Eε is assignable to uE. For each k, let Sk be the subset of (Rm )k = Rmk , possibly empty, consisting of those samples that are ±-shattered. We define n o µ = µβ := sup k ≥ 1 | Sk is a dense subset of Rmk . Our goal is to show that under reasonable assumptions on β, necessarily µ ≤ 2n + 1. The organization of this article is as follows. Section 2 discusses the precise statement of the result, for functions β that are analytic and “definable” as well as the fact that this bound is optimal. Section 3 provides a proof of the result, which is based on material on analytic functions and especially on new results from the theory of exponentials, which are reviewed briefly in the Appendix. Section 4 shows how to generalize the result to the piecewise

Shattering All Sets of k Points in “General Position”

339

analytic definable case, which allows the consideration of neural networks with discontinuous activation functions. 2 Precise Statements and Remarks Note that 2n+1 is the best possible upper bound, assuming that one wants to prove a result that includes at least all polynomially parameterized classes. To show this, we must exhibit for each integer n > 0 a polynomial response function β: Rn × Rmn → R with the property that S2n+1 is dense in R2n+1 . In fact, we give, for each n, a β with mn = 1 so that every sample of length 2n + 1 consisting of distinct vectors can be ±-shattered: β((x1 , . . . , xn ), u) := (u − x1 )2 (u − x2 )2 . . . (u − xn )2 . Take any uE ∈ R2n+1 with all ui distinct. The claim is that this is ±-shattered. Indeed, pick any εE = (ε1 , . . . , ε2n+1 ) ∈ {−1, 1}2n+1 . Without loss of generality, we assume that there are s ≤ n elements “−1” in the sequence εE (otherwise, we assign −Eε instead) and these are εi1 , . . . , εis . Let xj := uij for j = 1, . . . , s and pick xs+1 , . . . , xn to be any elements not in the set {u1 , . . . , u2n+1 }. We have that β(x, u) = 0 for each u = uij , j = 1, . . . , s, and β(x, u) > 0 for all other u ∈ R. Thus sign (β(x, ui )) = εi for all i. The question we address next is what are “reasonable” assumptions on the class β so that an upper bound like 2n+1 holds. It is not difficult to see that merely asking β to be an analytic function is not sufficient even to guarantee finiteness. This is illustrated trivially by the example β(x, u) = sin(xu), which has n = m = 1 but for which µβ = +∞ (this is a consequence of the fact that if the numbers u1 , . . . , uk , π are linearly independent over the rationals, then the set of vectors (sin(`u1 ), . . . , sin(`uk )), as ` ranges over the positive integers, is a dense subset of [−1, 1]k ). In fact, it is possible to define a fixed analytic function β, with m = 1, for which Sk consists of all sequences of k distinct elements of R, not merely dense in Rkm , for all k (see Sontag 1992). Thus, analyticity is in itself not enough to obtain nontrivial bounds. We will assume that β is analytic and definable. The Appendix recalls the definition and basic facts about (“exp-RA”) definable functions. Informally, these are functions that can be defined in terms of any first-order logic sentence built out of the standard propositional connectives, existential and universal quantification, and which involve rational operations and exponentiation. (Also allowed in the formulas are certain restricted analytic functions, for instance, arctan(x), but sin(x) is not included.) In particular—and this is of interest in the context of applications to “artificial neural networks”—any response of a “multilayer sigmoidal network” with “activation function” 1/(1 + e−x ) is definable in this sense because it is obtained by iterative compositions and polynomial combinations involving the activation function, which in turn is constructed by means of rational operations involving exponentiation.

340

Eduardo D. Sontag

Theorem 1. 2n + 1.

If β: Rn × Rm → R is an analytic definable response, then µβ ≤

This result is proved in the next section. Remark 2.1. In computational learning theory, one defines shattering of a sample uE by the requirement that each εE be assignable, and studies the set S+ k consisting of all those samples that are shattered. For our results, it seems more natural to look at ±-shattering because we then obtain a bestpossible bound. In any case, the two concepts are almost identical. Since S+ k ⊆ Sk for all k, the upper bound 2n + 1 will also apply in particular to km E sup{k ≥ 1|S+ k is a dense subset of R }. (Conversely, if u can be ±-shattered, then the new response function with n + 1 parameters, and defined for parameters vectors (x0 , x1 , . . . , xn ) ∈ Rn+1 by β0 ((x0 , x), u) := 1 − x0 β(x, u), shatters the same sample uE.) In learning theory and statistical estimation, the Vapnik-Chervonenkis (VC) dimension is defined as the supremum of the integers k such that S+ k is nonempty; here we are asking questions regarding shattering of arbitrary sets of points in “general position,” a concept that appears in Cover’s and in other classical pattern recognition work. For the best-known (obviously larger) upper bounds for VC dimension, for closely related classes of responses, see Karpinski and Macintyre (in press) and see also Maass (1994) and Koiran and Sontag (1996) for superlinear lower bounds. 3 Proof of Main Result We need to show that if k ≥ 2n + 2 then Sk is T not dense; that is, there is some nonempty open subset Q ⊆ Rmk so that Q Sk = ∅. It will be enough to show this when k = 2n + 2. In fact, we will establish a much stronger fact: m(n+1) , there is a nonempty open for each nonempty open subset T W0 ⊆ R subset Q ⊆ W02 such that Q S2n+2 = ∅. For any integer `, we introduce the following subset of Rm` : A` := {(u1 , . . . , u` ) | ∃ x st β(x, ·) 6≡ 0, β(x, u1 ) = · · · = β(x, u` ) = 0} . By corollary A.1, it follows that A` has a dimension strictly less than m` whenever ` = n + 1. The content of that lemma is totally intuitive: for each fixed x, except those for which the map β(x, u) is identically zero, the set of u1 ’s so that β(x, u1 ) = 0 is of dimension at most m − 1 (since this set is the set of zeroes of a nontrivial analytic function); and similarly for each of u2 , . . . , u` . Thus, for each such x, the Cartesian product of these sets, namely, the set of vectors (u1 , . . . , u` ) so that β(x, u1 ) = · · · = β(x, u` ) = 0, has dimension at most `(m − 1). If we now let x range over the whole parameter space Rn (but not including those x for which β(x, ·) ≡ 0), the set A` is obtained as a union of an n-parameter family of sets, each of which is of dimension ≤ `(m − 1), from which it follows that A` has dimension at most

Shattering All Sets of k Points in “General Position”

341

n + `(m − 1), which is less than m` provided ` = n + 1. The technicalities have to do only with defining precisely the meaning of “dimension”—this will mean the maximum possible dimension of a submanifold of the set— and establishing the obvious formulas, such as the fact that the union of an n-parameter family of sets of dimension ≤ s has dimension ≤ n + s. Let W0 be any open subset of Rm(n+1) . The fact that β is definable implies, by lemma A.2, that there are only two possibilities for each set A` : either it contains some open set, or it is nowhere dense. Since when ` = n + 1 this set does not have full dimension, as shown in the previous paragraph, the first alternative cannot hold, so it must be nowhere dense. That is, its closure has an empty interior, which implies that W0 \ clos An+1 is a nonempty open set. Thus there is some open set V of W0 ⊆ Rm(n+1) of the special product form V = V1 × · · · × Vn+1 , for T some nonempty connected open subsets V1 , . . . , Vn+1 of Rm , so that V An+1 = ∅. Now let Q ⊆ W02 ⊆ R2m(n+1) be the open set defined as follows: Q := V × V. The claim is that Q element

T

S2n+2 = ∅. To establish this fact, we take an arbitrary

(u1 , . . . , un+1 , v1 , . . . , vn+1 ) of Q and show that it cannot be ±-shattered; that is, there is some sequence εE = (ε1 , . . . , ε2n+2 ) ∈ {−1, 1}2n+2 for which neither εE nor −Eε is assignable. Consider the following sequence: εE := (1, . . . , 1, −1, . . . , −1) . {z } | {z } | n+1

n+1

Assume that there would exist some parameter vector x so that this vector is assigned: β(x, ui ) > 0, i = 1, . . . , n + 1, β(x, vi ) ≤ 0, i = 1, . . . , n + 1. Observe that, in particular, it holds that β(x, ·) 6≡ 0 for this x. For each i = 1, . . . , n + 1, let γi : [0, 1] → Vi ⊆ Rm be a continuous function so that γi (0) = ui and γi (1) = vi . (Recall that Vi is connected.) Since β(x, γi (t)) is a continuous function of t, we conclude that for each i there is some ti so that β(x, γi (ti )) = 0. Writing E := (w1 , . . . , wn+1 ) ∈ An+1 , which contradicts the wi := γi (ti ), this says that w E ∈ V1 × · · · × Vn+1 = V and that V was picked so that it is disjoint fact that w from An+1 . Thus εE is not assignable. If −Eε would be assignable, we would

342

Eduardo D. Sontag

have an x such that β(x, ui ) ≤ 0 for i = 1, . . . , n + 1 and β(x, vi ) > 0 for i = 1, . . . , n + 1, and the same argument leads once more to a contradiction. Remark 3.1. Observe that the definability property is used when proving that An+1 cannot be dense. If instead one would pick, for instance, β(x, u) = sin(xu), the property is false. For this example, A2 is the union of all the lines in R2 that pass through (0, 0) and have a rational slope (plus the y-axis), so A2 in this case is a dense set with an empty interior. Remark 3.2. As remarked during the proof, a stronger result is established. m In particular, for each nonempty open T subset W ⊆ R , there is a nonempty open subset Q ⊆ W 2n+2 such that Q S2n+2 = ∅. (Apply with W0 := W n+1 .) 4 Discontinuous Responses The main theorem requires that the response function β be analytic as well as definable. It is possible, however, to extend its validity to some other classes of responses, including those obtained by considering neural networks that use “heaviside” activation functions ½ 0 if v ≤ 0 H(v) = 1 if v > 0 as well as sigmoidal activations. Such responses are a particular case of a piecewise-analytic definable response, meaning a response that is described by a (possibly different) definable analytic function in each piece of a partition of the space of parameters and inputs. The partition in turn is assumed to be determined by the signs of a finite number of definable analytic functions. We now make this concept precise. The map β: Rn × Rm → R is a piecewise-analytic definable response if there is some integer p and there exist p + 2p analytic definable functions φ 1 , . . . , φ p : Rn × Rm → R and

©

ψα : Rn × Rm → R, α ∈ {0, 1}p

ª

such that the following property holds: for each (x, u) ∈ Rn × Rm , letting ¡ ¢ α(x, u) := H(φ1 (x, u)), . . . , H(φp (x, u)) , then β(x, u) = ψα(x,u) (x, u). (That is, in each of the components of the partition of the (x, u) space determined by the possible signs of the functions φi , β is expressed by an appropriate ψα .)

Shattering All Sets of k Points in “General Position”

343

Corollary 4.1. If β is piecewise analytic definable, then µβ ≤ 2n + 3. The proof will be based on the construction of an analytic definable response β 0 : Rn+1 × Rm → R with the following property: if a given sequence εE = (ε1 , . . . , εk ) ∈ {−1, 1}k is assignable to a sample uE = (u1 , . . . , uk ), with respect to the response β, then it is also assignable to the same sample with respect to the response β 0 . This means that if a sample uE is ±-shattered with respect to β, then it is also ±-shattered with respect to β 0 , so we have µβ ≤ µβ 0 . By the main theorem we know that µβ 0 ≤ 2(n + 1) + 1, so the desired conclusion follows. We now show how to construct β 0 . As a preliminary step, we establish a simple fact regarding the approximation of piecewise constant functions by sigmoidal combinations. We first introduce some additional notations. Given any vector r = (r1 , . . . , rp ) ∈ Rp , we let γ (r) := (H(r1 ), . . . , H(rp )). Consider the function p

θ: Rp × R2 → R : (r, s) 7→ sγ (r) p

(where we are indexing the coordinates of s ∈ R2 by vectors α = (α1 , . . . , αp ) ∈ {0, 1}p ). Now let σ : R → R be the function σ (v) = (1 + e−x )−1 and, denoting σ1 := σ , σ0 := 1 − σ , consider, for each positive real number ρ, the (definable and analytic) function: p

χ ρ : R p × R2 → R : X (r, s) 7→ σα1 (ρ 2 r1 − ρ) . . . σαp (ρ 2 rp − ρ)sα .

(4.1)

α∈{0,1}p

Lemma 4.1.

p

For each pair (r, s) ∈ Rp × R2 ,

lim χρ (r, s) = θ(r, s)

ρ→∞

(4.2)

and θ(r, s) = 0 ⇒ lim (1 + ρ 2 )χρ (r, s) = 0 . ρ→∞

(4.3)

Proof. Observe that we can rewrite θ as follows: X θ(r, s) = Hα1 (r1 ) . . . Hαp (rp )sα , α∈{0,1}p

where H1 := H, H0 := 1 − H. Thus the first conclusion of the lemma is an immediate consequence of the fact that limρ→∞ σ (ρ 2 r − ρ) = H(r) for each

344

Eduardo D. Sontag

r ∈ R. Now assume that (r, s) is such that θ (r, s) = sγ (r) = 0. This means that the term corresponding to the index α = γ (r) does not appear in the sum (cf. equation 4.1) defining χρ (r, s). Fix any other index α; we claim that lim (1 + ρ 2 )σα1 (ρ 2 r1 − ρ) . . . σαp (ρ 2 rp − ρ) = 0

ρ→∞

(from which the second conclusion will follow). Indeed, since α 6= γ (r), there must exist some j ∈ {1, . . . , p} for which αj 6= H(rj ); fix one such j. If αj = 1, then H(rj ) = 0 means that rj ≤ 0, and therefore (1+ρ 2 )σ (ρ 2 rj −ρ) → 0 as ρ → ∞; since the other terms are bounded, it follows that their product converges to zero. If instead αj = 0, then H(rj ) = 1 means that rj > 0, and thus (1 + ρ 2 )(1 − σ (ρ 2 rj − ρ)) → 0 as ρ → ∞; again since the other terms are bounded, the product goes to zero. Equation 4.3 implies in particular that when θ (r, s) = 0, χρ (r, s) −

1 <0 1 + ρ2

for all ρ large enough. Together with equation 4.2, this means that µ ¶ 1 sign χρ (r, s) − = sign θ (r, s) 1 + ρ2 p

as ρ → ∞, for each (r, s) ∈ Rp × R2 . We now prove corollary 4.1. Let β, φ1 , . . . , φp , and the ψα ’s be as in the definition of piecewise-analytic definable response. Denote 8(x, u) := p (φ1 , . . . , φp ) and let 9(x, u) ∈ R2 be the vector whose αth component is ψα (x, u). Then β(x, u) = θ(8(x, u), 9(x, u)). We define β 0 ((ρ, x), u) := χρ (8(x, u), 9(x, u)) −

1 . 1 + ρ2

Thus sign β 0 ((ρ, x), u) = sign β(x, u) as ρ → ∞, for each (x, u) ∈ Rn × Rm . Now assume that the sequence εE = (ε1 , . . . , εk ) ∈ {−1, 1}k is assignable to the sample uE = (u1 , . . . , uk ) with respect to the response β. This means that there is some parameter vector x ∈ Rn so that sign(β(x, ui )) = εi for each i = 1, . . . , k. For large enough ρ, then, also sign(β 0 ((ρ, x), ui )) = εi for all i, so we conclude that εE is also assignable to the same sample with respect to the response β 0 . Remark 4.1. The bound 2n + 3 can be reduced to 2n + 1 provided that the classes of functions φi (x, ·) and ψ(x, ·) are closed under scaling and translation (as is the case commonly when dealing with artificial neural networks, where these are affine functions and the parameters x correspond to the coefficients of linear combinations [weights] and constant term [bias]). In that case, the parameter ρ can be absorbed into the remaining parameters.

Shattering All Sets of k Points in “General Position”

345

Appendix We first review some elementary facts regarding real-analytic geometry. If M is any (analytic, second-countable) manifold, by a σ -analytic subset Z of M we mean one that can be decomposed into a countable union of embedded analytic submanifolds; for such a set Z, we define dim Z as the largest dimension of a submanifold appearing in such a union. (The dimension is well defined, in the sense that it does not depend on the particular decomposition.) Consider now any analytic map π : M → N between two manifolds. For each y ∈ N, the “fiber” My = {x ∈ M | π(x) = y} provides an example of a σ analytic subset (and if M is connected, either My = M or dim My < dim M). Now take any σ -analytic subset Z of M. Then the image π(Z) is a σ -analytic subset of N, having dim π(Z) ≤ dim Z, and the following inequality holds: h \ i Z . dim Z ≤ dim π(Z) + max dim My y∈π(Z)

(A.1)

(This inequality is a simple consequence of stratification theory; it is proved in the appendix of Sontag 1996. Note, incidentally, that a strict inequality may hold, as illustrated by the case M = R2 , N = R, π = projection on first factor, and Z = the union of the x and y axes.) We prove the following fact, which is basically theorem 1 in Sontag (1996). (That theorem could also be applied more directly, but the slight generalization given here should be useful for other purposes.) Lemma A.1. Let X, U1 , . . . , U` be (analytic, second countable) manifolds, with dimensions n, m1 , . . . , m` respectively, and each Ui connected. Assume given ` analytic maps βi : X × U1 × · · · Ui → R, i = 1, . . . , ` so that the following property holds for each i = 1, . . . , `: ∀ x ∈ X, ∀ u1 ∈ U1 , . . . , ∀ ui−1 ∈ Ui−1 , ∃u ∈ Ui so that βi (x, u1 , . . . , ui−1 , u) 6= 0. Let

G` := {(x, u1 , . . . , u` ) ∈ X × U1 × · · · × U` | β1 (x, u1 ) = · · · = β` (x, u1 , . . . , u` ) = 0} .

(A.2)

346

Eduardo D. Sontag

Then this is a σ -analytic set with dim G` ≤ n +

` X

mj − ` .

j=1

Proof. Define Gi := {(x, u1 , . . . , ui ) ∈ X × U1 × · · · × Ui | β1 (x, u1 ) = · · · = βi (x, u1 , . . . , ui ) = 0} for each Pi i = 1, . . . , `, and G0 := X; we prove by inducmj − i. For i = 0 the result holds, so we now tion on i that dim Gi ≤ n + j=1 assume it has been proved for i and show the case i + 1. Let π: X × U1 × · · · × Ui+1 → X × U1 × · · · × Ui be the projection on the first 1+i factors. Write Z := Gi+1 . Note that π(Z) ⊆ Gi by definition of these sets, so in particular it holds by inductive hypothesis that dim π(Z) ≤ n +

i X

mj − i .

(A.3)

j=1

T For each fixed y = (x, u1 , . . . , ui ) ∈ π(Z), π −1 (y) Z = {x} × {u1 } × · · · × {ui } × Q, where © ª Q(x,u1 ,...,ui ) := u ∈ Ui+1 | βi+1 (x, u1 , . . . , ui , u) = 0 . Property A.2 implies that dim Q(x,u1 ,...,ui ) ≤ mi+1 −1. Thus from equation A.1 and using equation A.3, we conclude that dim Z ≤ n +

i X

mj − i + mi+1 − 1 = n +

j=1

i+1 X

mj − (i + 1) ,

j=1

and the induction is complete. Corollary A.1. ` the set

Let β : Rn × Rm → R be analytic, and consider for any integer

A` := {(u1 , . . . , u` ) | ∃ x st β(x, ·) 6≡ 0, β(x, u1 ) = · · · = β(x, u` ) = 0}. Then this is a σ -analytic set of dimension at most n + `(m − 1). Proof. Let X = {x ∈ Rn | β(x, ·) 6≡ 0}; this is an open subset of Rn and hence a submanifold. (If X = ∅, there is nothing to prove.) Apply lemma A.1 with βi (x, u1 , . . . , ui ) := β(x, ui ) (and all Ui = U); observe that property A.2 holds by definition of X. Then dim G` ≤ n + `(m − 1), and the same bound applies to its projection A` on the U components.

Shattering All Sets of k Points in “General Position”

347

Finally, we review several facts about exp-RA definable functions. This discussion is based on recent work in model theory due to Gabrielov, Van den Dries, Wilkie, and many others; see Macintyre and Sontag (1993) and Sontag (1996) for more details and extensive bibliographical references. A restricted analytic (RA) function Rq → R is any function that is real analytic on some compact subset C of Rq and is zero outside C (examples: the functions sin and cos, which equal sin and cos on [−π/2, π/2] and are zero outside this interval). Formally, consider the structure L = (R, +, ·, <, 0, 1, exp, { f, f ∈ RA}), and the corresponding language for the real numbers with addition, multiplication, and order, as well as one function symbol for real exponentiation and one for each restricted analytic function. An (exp-RA) definable set is any subset of Rk defined by a first-order formula over L with k free variables. A definable function is a function β : M → N whose graph is a definable set in this sense, and where N and M are definable subsets of two spaces Rl1 and Rl2 , respectively. For example, any function obtained by compositions of rational operations and taking exponentials is definable, such as 1/(1 + e−x ); also definable is the function arctan(x), because its graph is characterized by the formula “y = arctan(x) iff −π/2 < y < π/2 and sin(y) = x cos(y).” Notice that any set obtained from a function β by logical operations (such as the set A` in corollary A.1) is definable provided that β is definable. The following fact is a nontrivial consequence of the work in logic cited above; see precise references in Sontag (1996): Lemma A.2. Let S be a definable subset of Rq for some q. Then either S contains an open subset, or it is a finite union of connected embedded submanifolds of Rq (and in particular is nowhere dense). Acknowledgments This work was supported in part by US Air Force grant AFOSR-94-0293. References Cover, T. M. 1988. Capacity problems for linear machines. In Pattern Recognition, L. Kanal, ed., pp. 283–289. Thompson Book Co. Karpinski, M., and Macintyre, A. in press. Polynomial bounds for VC dimension of sigmoidal and general Pfaffian neural networks. J. Computer Systems Sciences. Koiran, P., and Sontag, E. D. in press. Neural networks with quadratic VC dimension. J. Computer Systems Sciences. Kowalczyk, A. in press. Estimates of storage capacity of mutilayer perceptron with threshold logic units. Neural Networks. Maass, M. 1994. Perspectives of current research about the complexity of learning in neural nets. In Theoretical Advances in Neural Computation and Learning,

348

Eduardo D. Sontag

V. P. Roychowdhury, K. Y. Siu, and A. Orlitsky, eds., pp. 295–336. Kluwer, Boston. Maass, W. 1996. Lower bounds for the computational power of networks of spiking neurons. Neural Computation 8, 1–40. Maass, W. in press. The third generation of neural network models. In Proc. 7th Australian Conference on Neural Networks 1996, Canberra, Australia. Macintyre, A., and Sontag, E. D. 1993. Finiteness results for sigmoidal “neural” networks. In Proc. 25th Annual Symp. Theory Computing, pp. 325–334, San Diego, CA. Sontag, E. D. 1992. Feedforward nets for interpolation and classification. J. Comp. Syst. Sci. 45, 20–48. Sontag, E. D. 1996. Critical points for least-squares problems involving certain analytic functions, with applications to sigmoidal nets. Advances in Computational Mathematics 5, 245–268.

Received March 18, 1996; accepted May 21, 1996.

Communicated by Shun-ichi Amari

Statistical Inference, Occam’s Razor, and Statistical Mechanics on the Space of Probability Distributions Vijay Balasubramanian Department of Physics, Princeton University, Princeton, NJ 08544 USA

The task of parametric model selection is cast in terms of a statistical mechanics on the space of probability distributions. Using the techniques of low-temperature expansions, I arrive at a systematic series for the Bayesian posterior probability of a model family that significantly extends known results in the literature. In particular, I arrive at a precise understanding of how Occam’s razor, the principle that simpler models should be preferred until the data justify more complex models, is automatically embodied by probability theory. These results require a measure on the space of model parameters and I derive and discuss an interpretation of Jeffreys’ prior distribution as a uniform prior over the distributions indexed by a family. Finally, I derive a theoretical index of the complexity of a parametric family relative to some true distribution that I call the razor of the model. The form of the razor immediately suggests several interesting questions in the theory of learning that can be studied using the techniques of statistical mechanics. 1 Introduction In recent years increasingly precise experiments have directed the interest of biophysicists toward learning in simple neural systems. The typical context of such learning involves an estimation of some behaviorally relevant information from a statistically varying environment. For example, the experiments of de Ruyter and collaborators have provided detailed measurements of the adaptive encoding of wide field horizontal motion by the H1 neuron of the blowfly (de Ruyter et al. 1995). Under many circumstances the associated problems of statistical estimation can be fruitfully cast in the language of statistical mechanics, and the powerful techniques developed in that discipline can be brought to bear on questions regarding learning (Potters and Bialek 1994). This article concerns a problem that arises frequently in the context of biophysical and computational learning: the estimation of parametric models of some true distribution t based on a collection of data drawn from t. If we are given a particular family of parametric models (gaussians, for example), the task of modeling t is reduced to parameter estimation, which is a relatively well-understood, though difficult, problem. Much less is known Neural Computation 9, 349–368 (1997)

c 1997 Massachusetts Institute of Technology °

350

Vijay Balasubramanian

about the task of model family selection. For example, how do we choose between a family of gaussians and a family of 50 exponentials as a model for t based on the available data? In this article we will be concerned with the latter problem, on which considerable ink has already been expended in the literature (Rissanen 1984, 1986; Barron 1985; Clarke and Barron 1990; Barron and Cover 1991; Wallace and Freeman 1987; Yamanishi 1995; MacKay 1992a, 1992b; Moody 1992; Murata et al. 1994). The first contribution is to cast Bayesian model family selection more clearly as a statistical mechanics on the space of probability distributions in the hope of making this important problem more accessible to physicists. In this language, a finite dimensional parametric model family is viewed as a manifold embedded in the space of probability distributions. The probability of the model family given the data can be identified with a partition function associated with a particular energy functional. The formalism bears a resemblance to the description of a disordered system in which the number of data points play the role of the inverse temperature and in which the data play the role of the disordering medium. Exploiting the techniques of low-temperature expansions in statistical mechanics, it is easy to extend existing results that use gaussian approximations to the Bayesian posterior probability of a model family to find “Occam factors” penalizing complex models (Clarke and Barron 1990; MacKay 1992a, 1992b). We find a systematic expansion in powers of 1/N, where N is the number of data points, and identify terms that encode accuracy, model dimensionality, and robustness, as well as higher-order measures of simplicity. The subleading terms, which can be important when the number of data points is small, represent a limited attempt to move analysis of Bayesian statistics away from asymptotics toward the regime of small N that is often biologically relevant. The results presented here do not require the true distribution to be a member of the parametric family under consideration, and the model degeneracies that can threaten analysis in such cases are dealt with by the method of collective coordinates from statistical mechanics. Some connections with the minimum description length principle and stochastic complexity are mentioned (Barron and Cover 1991; Clarke and Barron 1990; Rissanen 1984, 1986; Wallace and Freeman 1987). The relationship to the network information criterion of Murata et al. (1994) and the “effective dimension” of Moody (1992) is discussed. In order to perform Bayesian model selection, it is necessary to have a prior distribution on the space of parameters of a model. Equivalently, we require the correct measure on the phase space defined by the parameter manifold in the analog statistical mechanical problem considered in this article. In the absence of well-founded reasons to pick a particular prior distribution, the usual prescription is to pick an unbiased prior density that weights all parameters equally. However, this prescription is not invariant under reparametrization, and I will argue that the correct prior should give equal weight to all distributions indexed by the parameters. Requiring

Statistical Inference, Occam’s Razor, and Statistical Mechanics

351

all distributions to be a priori equally likely yields Jeffreys’ prior on the parameter manifold, giving a new interpretation of this choice of prior density (Jeffreys 1961). Finally, consideration of the large N limit of the asymptotic expansion of the Bayesian posterior probability leads us to define the razor of a model—a theoretical index of the complexity of a parametric family relative to a true distribution. In statistical mechanical terms, the razor is the quenched approximation of the disordered system studied in Bayesian statistics. Analysis of the razor using the techniques of statistical mechanics can give insights into the types of phenomena that can be expected in systems that perform Bayesian statistical inference. These phenomena include phase transitions in learning and adaptation to changing environments. In view of the length of this article, applications of the general framework developed here to specific models relevant to biophysics will be left to future publications. 2 Statistical Inference and Statistical Mechanics Suppose we are given a collection of outcomes E = {e1 , . . . , eN }, ei ∈ X drawn independently from a density t. Suppose also that we are given two parametric families of distributions A and B, and we wish to pick one of them as the model family that we will use. The Bayesian approach to this problem consists of computing the posterior conditional probabilities Pr(A|E) and Pr(B|E) and picking the family with the higher probability. Let A be parameterized by a set of real parameters 2 = {θ1 , . . . , θd }. Then Bayes’ rule tells us that Pr(A) Pr(A|E) = Pr(E)

Z dd 2 w(2) Pr(E|2).

(2.1)

In this expression Pr(A) is the prior probability of the model family, w(2) is a prior density on the parameter space, and Pr(E) = Pr(A) Pr(E|A) + Pr(B) Pr(E|B) is a density on the N outcome sample space. The measure induced by the parameterization of the d dimensional parameter manifold is denoted dd 2. Since we are interested in comparing Pr(A|E) with Pr(B|E), the density Pr(E) is a common factor that we may omit, and for lack of any better choice we take the prior probabilities of A and B to be equal and omit them.1 Finally, throughout this article we will assume that the model families of interest have compact parameter spaces. This condition is easily

1 Bayesian inference is sometimes carried out by looking for the best-fit parameter that maximizes w(2) Pr(E|2) rather than by averaging over the 2 as in equation 2.1. However, the average over model parameters is important because it encodes Occam’s razor, or a preference for simple models. Also see Balasubramanian (1996).

352

Vijay Balasubramanian

relaxed by placing regulators on noncompact parameter spaces, but we will not concern ourselves with this detail here. 2.1 Natural Priors or Measures on Phase Space. In order to make further progress, we must identify the prior density w(2). In the absence of a well-motivated prior, a common prescription is to use the uniform distribution on the parameter space since this is deemed to reflect complete ignorance (MacKay 1992a). In fact, this choice suffers from the serious deficiency that the uniform priors relative to different parameterizations can assign different probability masses to the same subset of parameters (Jeffreys 1961; Lee 1989). Consequently, if w(2) was uniform in the parameters, the probability of a model family would depend on the arbitrary parameterization. The problem can be solved by making the much more reasonable requirement that all distributions rather than all parameters are equally likely.2 In order to implement this requirement, we should give equal weight to all distinguishable distributions on a model manifold. However, nearby parameters index very similar distributions. So let us ask: How do we count the number of distinct distributions in the neighborhood of a point on a parameter manifold? Essentially this is a question about the embedding of the parameter manifold in the space of distributions. Points that are distinguishable as elements of Rn may be mapped to indistinguishable points (in some suitable sense) of the embedding space. To answer the question, let 2p and 2q index two distributions in a parametric family, and let E = {e1 , . . . , eN } be drawn independently from one of 2p or 2q . In the context of model estimation, a suitable measure of distinguishability can be derived by asking how well we can guess which of 2p or 2q produced E. Let αN be the probability that 2q is mistaken for ² be the 2p , and let βN be the probability that 2p is mistaken for 2q . Let βN smallest possible βN given that αN < ². Then Stein’sR lemma tells us that ² = D(2p k2q ), where D(pkq) = dx p(x) ln(p(x)/q(x)) limN→∞ (−1/N) ln βN is the relative entropy between the densities p and q (Cover 1991). As shown in the Appendix, the proof of Stein’s lemma shows that the ² exceeds a fixed β ∗ in the region where κ/N ≥ D(2p k2q ) minimum error βN ∗ with κ ≡ − ln β + ln(1 − ²).3 By taking β ∗ close to 1 we can identify the region around 2p where the distributions are not very distinguishable from the one indexed by 2p . As N grows large for fixed κ, any 2q in this region is necessarily close to 2p since D(2p k2q ) attains a minimum of zero when 2p = 2q . Therefore, setting 12 = 2q − 2p , Taylor expansion gives D(2p k2q ) ≈ (1/2)Jij (2p )12i 12 j + O(123 ) where

2

This applies the principle of maximum entropy on the invariant space of distributions rather than the arbitrary space of parameters. 3 This assertion is not strictly true. See the Appendix for more details.

Statistical Inference, Occam’s Razor, and Statistical Mechanics

353

Jij = ∇φi ∇φj D(2p k2p + 8)|8=0 is the Fisher information.4 (We use the convention that repeated indices are summed over.) In a certain sense, the relative entropy, D(2p k2q ), appearing in this problem is the natural distance between probability distributions in the context of model selection. Although it does not itself define a metric, the Taylor expansion locally yields a quadratic form with the Fisher information acting as the metric. If we accept Jij as the natural metric, differential geometry immediately tells us that the p reparameterization invariant measure on the parameter manifold is dd 2 det J (Amarip1985; Amari et al. 1987). NormalR izing this measure by dividing by dd 2 det J gives the so-called Jeffreys’ prior on the parameters. A more satisfying explanation of the choice of prior proceeds by directly counting the number of distinguishable distributions in the neighborhood of a point on a parameter manifold. Define the volume of indistinguishability at levels ², β ∗ , and N to be the volume of the region around 2p where κ/N ≥ D(2p k2q ) so that the probability of error in distinguishing 2q from 2p is high. We find to leading order: µ V²,β ∗ ,N =

2πκ N

¶d/2

1 1 p . 0(d/2 + 1) det Jij (2p )

(2.2)

If β ∗ is very close to one, the distributions inside V²,β ∗ ,N are not very distinguishable, and the Bayesian prior should not treat them as separate distributions. We wish to construct a measure on the parameter manifold that reflects this indistinguishability. We also assume a principle of translation invariance by supposing that volumes of indistinguishability at given values of N, β ∗ , and ² should have the same measure regardless of where in the space of distributions they are centered. An integration measure reflecting these principles of indistinguishability and translation invariance can be defined at each level β ∗ , ², and N by covering the parameter manifold economically with volumes of indistinguishability and placing a delta function in the center of each element of the cover. This definition reflects indistinguishability by ignoring variations on a scale smaller than the covering volumes and reflects translation invariance by giving each covering volume equal weight in integrals over the parameter manifold. The measure can be normalized by an integral over the entire parameter manifold to give a prior distribution. The continuum limit of this discretized measure is obtained by taking the limits β ∗ → 1, ² → 0, and N → ∞. In this limit, the measure counts distributions that are completely indistinguishable (β ∗ = 1) even in the presence of an infinite amount of data (N = ∞).5 4 I have assumed that the derivatives with respect to 2 commute with expectations taken in the distribution 2p to identify the Fisher information with the matrix of second derivatives of the relative entropy. 5 The α and β errors can be treated more symmetrically using the Chernoff bound instead of Stein’s lemma, but I will not do that here.

354

Vijay Balasubramanian

To see the effect of the above procedure, imagine a parameter manifold that can be partitioned into k regions in each of which the Fisher information is constant. Let Ji , Ui , and Vi , respectively, be the Fisher information, parameteric volume, and volume of indistinguishability in the ith region. Then the prior assigned to the ith volume by the procedure will be p p Pk Pk (Uj /Vj ) = Ui det Ji / j=1 Uj det Jj . Since all the β ∗ , ², Pi = (Ui /Vi )/ j=1 and N dependencies cancel, we are now free to take the continuum limit of Pi . This suggests that the prior density induced by the prescription described in the previous paragraph is: w(2) = R

p det J(2) p . d d 2 det J(2)

(2.3)

By paying careful attention to technical difficulties involving sets of measure zero and certain sphere packing problems, it can be rigorously shown that the normalized continuum measure on a parameter manifold that reflects indistinguishability and translation invariance is w(2) or Jeffreys’ prior (Balasubramanian 1996). In essence, the heuristic argument and the derivation in Balasubramanian (1996) show how to “divide out” the volume of indistinguishable distributions on a parameter manifold and hence give equal weight to equally distinguishable volumes of distributions. In this sense, Jeffreys’ prior is seen to be a uniform prior on the distributions indexed by a parameteric family. 2.2 Connection with Statistical Mechanics. Putting everything together, we get the following expression for the Bayesian posterior probability of a parameteric family in the absence of any prior knowledge about the relative likelihood of the distributions indexed by the family: R Pr(A|E) =

h ³ ´i p dd 2 det J exp −N − ln Pr(E|2) N p . R dd 2 det J

(2.4)

This equation resembles a partition function with a temperature 1/N and an energy function (−1/N) ln Pr(E|2). The dependence on the data E is similar to the dependence of a disordered partition function on the specific set of defects introduced into the system. The analogy can be made stronger since the strong law of large numP bers says that (−1/N) ln Pr(E|2) = (−1/N) N i=1 ln Pr(ei |2) converges in the almost sure sense to: ¸ Z µ ¶ Z · t(x) − ln Pr(ei |2) = dx t(x) ln − dx t(x) ln (t(x)) Et N Pr(x|2) = D(tk2) + h(t) . (2.5)

Statistical Inference, Occam’s Razor, and Statistical Mechanics

355

Here D(tk2) is the relative entropy between the true distribution and the distribution indexed by 2, and h(t) is the differential entropy of the true distribution that is presumed to be finite. With this large N limit in mind, we rewrite the posterior probability in equation 2.4 as the following partition function: R d √ −N(H +H ) 0 d d 2 Je R , (2.6) Pr(A|E) = √ d d 2 J where H0 (2) = D(tk2) and Hd (E, 2) = (−1/N) ln Pr(E|2) − D(tk2) − h(t). (Equation 2.6 differs from equation 2.4 by an irrelevant factor of exp [−Nh(t)].) H0 can be regarded as the “energy” of the “state” 2, while Hd is the additional contribution that arises by interaction with the “defects” represented by the data. It is instructive to examine the quenched approximation to this disordered partition function. (See Ma 1985 for a discussion of quenching in statistical mechanical systems.) Quenching is carried out by taking the expectation value of the energy of a state in the distribution generating the defects. In the above system Et [Hd ] = 0, giving the quenched posterior probability R Pr(A|E)Q =

√ dd 2 Je−ND(tk2) R . √ dd 2 J

(2.7)

In Section 4 we will see that the logarithm of the posterior probability converges to the logarithm of the quenched probability in a certain sense. This will lead us to regard the quenched probability as a sort of theoretical index of the complexity of a parameteric family relative to a given true distribution. 3 Asymptotic Analysis or Low-Temperature Expansion Equation 2.4 represents the full content of Bayesian model selection. However, in order to extract some insight, it is necessary to examine special cases. Let ln Pr(E|2) be a smooth function of 2 that attains a global minimum at ˆ and assume that Jij (2) is a smooth function of 2 that is positive definite 2 ˆ Finally, suppose that 2 ˆ lies in the interior of the compact parameter at 2. space and that the values of local minima are bounded away from the global minimum by some b.6 For any given b, for sufficiently large N, the Bayesian ˆ and posterior probability will then be dominated by the neighborhood of 2, ˆ we can carry out a low-temperature expansion around the saddlepoint at 2. We take the metric on the parameter manifold to be the Fisher information since the Jeffreys’ prior has the form of a measure derived from such a metric. 6

In Section 3.2 we will discuss how to relax these conditions.

356

Vijay Balasubramanian

This choice of metric also follows the work described in Amari (1985) and Amari et al. (1987). We will use ∇µ to indicate the covariant derivative with respect to 2µ , with a flat connection for the Fisher information metric.7 Readers who are unfamiliar with covariant derivatives may read ∇µ as the partial derivative with respect to 2µ since I will not be emphasizing the geometric content of the covariant derivative. Let I˜µ1 ,...,µi = (−1/N)∇µ1 , . . . , ∇µi ln Pr(E|2)|2ˆ and Fµ1 ,...,µi = ∇µ1 , . . . , ∇µi Tr ln Jij |2ˆ where Tr represents the trace of a matrix. Writing (det J)1/2 as exp [(1/2)Tr ln J], we Taylor expand the exponent in the integrand of ˆ and rescale the integration variable to the Bayesian posterior around 2 1/2 ˆ 8 = N (2 − 2) to arrive at: £ ¤ 1 R ˆ ˆ ˜ µ ν − ln Pr(E|2)− 2 Tr ln J(2) e N−d/2 dd 8e−((1/2)Iµν φ φ +G(8)) R p Pr(A|E) = . (3.1) dd 2 det Jij Here G(8) collects the terms that are suppressed by powers of N: · ¸ ∞ X 1 1 1 µ1 µ(i+2) µ1 µi ˜ Iµ1 ,...,µi+2 φ , . . . , φ Fµ ,...,µi φ , . . . , φ − G(8) = √ 2i! 1 Ni (i + 2)! i=1 · ¸ 1˜ 1 1 Iµ1 µ2 µ3 φ µ1 φ µ2 φ µ3 − Fµ1 φ µ1 = √ 2 N 3! · ¸ µ ¶ 1 1 1 1˜ Iµ1 ,...,µ4 φ µ1 , . . . , φ µ4 − Fµ1 µ2 φ µ1 φ µ2 + O . (3.2) + N 4! 2 2! N3/2 As before, repeated indices are summed over. The integral in equation 3.1 may now be evaluated in a series expansion using a standard trick from statistical mechanics (Itzykson and Drouffe 1989). Define a “source” h = {h1 , . . . , hd } as an auxiliary variable. Then it is easy to verify that: Z

˜

dd 8 e−((1/2)Iµν φ

µ ν

φ +G(8))

= e−G(∇h )

Z

˜

dd 8 e−((1/2)Iµν φ

µ ν

φ +hµ φ µ )

,

(3.3)

where the argument of G, 8 = (φ 1 , . . . , φ d ), has been replaced by ∇h = {∂h1 , . . . , ∂hd } and we assume that the derivatives commute with the integral. The remaining obstruction is the compactness of the parameter space. We make the final assumption that the bounds of the integration can be exˆ is sufficiently in the interior tended to infinity with negligible error since 2 or because N is sufficiently large. Performing the gaussian integral in equation 3.3 and applying the differential operator exp G(∇h ) we find an asymptotic series in powers of 1/N. 7 See Amari (1985) and Amari et al. (1987) for discussions of differential geometry in a statistical setting.

Statistical Inference, Occam’s Razor, and Statistical Mechanics

357

It turns outpto be most useful to examine χE (A) ≡ − ln Pr(A|E). Defining R V = dd 2 det J(2) we find to O(1/N): # Ã ! ˆ ˆ det Jij (2) d 1 − ln Pr(E|2) χE (A) = N + ln N − ln ˆ N 2 2 det I˜µν (2) # " (2π)d/2 − ln V ( i 1 I˜µ1 µ2 µ3 µ4 h ˜−1 µ1 µ2 ˜−1 µ3 µ4 (I ) (I ) + ··· + N 4! h i Fµ µ − 1 2 (I˜−1 )µ1 µ2 + (I˜−1 )µ2 µ1 2 2! i I˜µ µ µ I˜ν ν ν h − 1 2 3 1 2 3 (I˜−1 )µ1 µ2 (I˜−1 )µ3 ν1 (I˜−1 )ν2 ν3 + · · · 2! 3! 3! i Fµ1 Fµ2 h ˜−1 µ1 µ2 (I ) + ··· − 2! 4 2! 2! ) i Fµ1 I˜µ2 µ3 µ4 h ˜−1 µ1 µ2 ˜−1 µ3 µ4 (I ) (I ) + ··· . + 2! 2 2! 3! "

(3.4)

Terms of higher orders in 1/N are easily evaluated with a little labor, and systematic diagrammatic expansions can be developed (Itzykson and Drouffe 1989). In the next section I discuss the meaning of equation 3.4. 3.1 Meaning of the Asymptotic Expansion. I will show that the Bayesian posterior measures simplicity and accuracy of a parameteric family by examining equation 3.4 and noting that models with larger Pr(A|E), and hence smaller χE (A), are better. Since the expansion in equation 3.4 is only asymptotic and not convergent, the number of terms that can be reliably kept will depend on the specific model family A under consideration, so we will simply analyze the general behavior of the first few terms. Furthermore, χE (A)/N√is a random variable whose standard deviation can be expected to be O(1/ N). Although these fluctuations threaten to overwhelm subleading terms in the expansion, it is sensible to consider the meaning of these terms because they can be important when N is not asymptotically large. ˆ The O(N) term, N(− ln Pr(E|2)/N), which dominates asymptotically, is the log likelihood of the data evaluated at the maximum likelihood point.8 It measures the accuracy with which the parameteric family can describe the available data. We will see in section 4 that for sufficiently large N, model 8 This term is O(N), not O(1), because (1/N) ln Pr(E|2) ˆ approaches a finite limit at large N.

358

Vijay Balasubramanian

families with the smallest relative entropy distance to the true distribution are favored by this term. The term of O(N) arises from the saddlepoint value of the integrand in equation 2.4 and represents the Landau approximation to the partition function. The term of O(ln N) penalizes models with many degrees of freedom and is a measure of simplicity. This term arises physically from the statistical fluctuations around the saddlepoint configuration. These fluctuations cause the partition function in equation 2.4 to scale as N−d/2 , leading to the logarithmic term in χE . Note that the terms of O(N) and O(ln N) have appeared together in the literature as the stochastic complexity of a parameteric family relative to a collection of data (Rissanen 1984, 1986). This definition is justified by arguing that a family with the lowest stochastic complexity provides the shortest codes for the data in the limit that N → ∞. Our results suggest that stochastic complexity is merely a truncation of the logarithm of the posterior probability of a model family and that adding the subleading terms in χE to the definition of stochastic complexity would yield shorter codes for finite N. The O(1) term, which arises from the determinant of quadratic fluctuations around the saddlepoint, is even more interesting. The determinant of I˜−1 is proportional to the volume of the ellipsoid in parameter space around ˆ where the value of the integrand of the Bayesian posterior is significant.9 2 The scale for determining whether det I˜−1 is large or small is set by the Fisher information on the surface whose determinant defines the volume element. ˜ 1/2 can be understood as measuring Consequently the term ln(det J/ det I) the robustness of the model in the sense that it measures the relative volume of the parameter space that provides good models of the data. More robust models in this sense will be less sensitive to the precise choice of parameters. We also observe from the discussion regarding Jeffreys’ prior that the volˆ is proportional to (det J)−1/2 . So the ume of indistinguishability around 2 ˜ (1/2) is essentially proportional to the ratio Vlarge /Vindist , quantity (det J/ det I) the ratio of the volume where the integrand of the Bayesian posterior is large to the volume of indistinguishability introduced earlier. Essentially a model family is better (more natural or robust) if it contains many distinguishable distributions that are close to the true. Related observations have been made before (MacKay 1992a, 1992b; Clarke and Barron 1990) but without the interpretation in terms of the robustness of a model family. The term ln (2π)d /V can be understood as a preference for models that have a smaller invariant volume in the space of distributions and hence are more constrained. It may be argued that this term is a bit artificial since it can be ill-defined for models with noncompact parameter spaces. In such situations, a regulator would have to be placed on the parameter space to yield a 9 If we fix a fraction f < 1 where f is close to 1, the integrand of the Bayesian posterior will be greater than f times the peak value in an elliptical region around the maximum.

Statistical Inference, Occam’s Razor, and Statistical Mechanics

359

finite volume. The terms proportional to 1/N are less easy to interpret. They involve higher derivatives of the metric on the parameter manifold and of the relative entropy distances between points on the manifold and the true distribution. This suggests that these terms essentially penalize high curvatures of the model manifold, but it is hard to extract such an interpretation in terms of components of the curvature tensor on the manifold. It is worth noting that while terms of O(1) and larger in χE (A) depend at most on the measure (prior distribution) assigned to the parameter manifold, the terms of O(1/N) depend on the geometry via the connection coefficients in the covariant derivatives. For this reason, the O(1/N) terms are the leading probes of the effect that the geometry of the space of distributions has on statistical inference in a Bayesian setting, and so it would be very interesting to analyze them. In summary, Bayesian model family inference embodies Occam’s razor because, for small N, the subleading terms that measure simplicity and robustness will be important, while for large N, the accuracy of the model family dominates. 3.2 Analysis of More General Situations. The asymptotic expansion in equation 3.4 and the subsequent analysis were carried out for the special case of a posterior probability with a single global maximum in the integrand that lay in the interior of the parameter space. Nevertheless, the basic insights are applicable far more generally. First, if the global maximum lies on the boundary of the parameter space, we can account for the portion of the peak that is cut off by the boundary and reach essentially the same conclusions. Second, if there are multiple discrete global maxima, each contributes separately to the asymptotic expansion, and the contributions can be added to reach the same conclusions. The most important difficulty arises when the global maximum is degenerate so that matrix I˜ in equation 3.4 has zero eigenvalues. The eigenvectors corresponding to these zeroes are tangent to directions in parameter space in which the value of the maximum is unchanged up to the second order in perturbations around the maximum. These sorts of degeneracies are particularly likely to arise when the true distribution is not a member of the family under consideration and can be dealt with by the method of collective coordinates. Essentially we would choose new parameters for the model, a subset of which parameterize the degenerate subspace. The integral over the degenerate subspace then factors out of the integral in equation 3.3 and essentially contributes a factor of the volume of the degenerate subspace times terms arising from the action of the differential operator exp[−G(∇h )]. The evaluation of specific examples of this method in the context of statistical inference will be left to future publications. There are situations in which the perturbative expansion in powers of 1/N is invalid. For example, the partition function in equation 2.6 regarded

360

Vijay Balasubramanian

as a function of N may have singularities. These singularities and the associated breakdown of the perturbative analysis of this section would be of the utmost interest since they would be signatures of “phase transitions” in the process of statistical inference. This point will be discussed further in Section 5. 4 The Razor of a Model Family The large N limit of the partition function in equation 2.4 suggests the definition of an ideal theoretical index of the complexity of a parameteric family relative to a given true distribution. We know from equation 2.5 that (−1/N) ln [Pr(E|2)] → D(tk2) + h(t) as N grows large. Now assume that the maximum likelihod estimator is ˆ = arg max2 ln Pr(E|2) converges in probconsistent in the sense that 2 ability to 2∗ = arg min2 D(tk2) as N grows large.10 Also suppose that the log likelihood of a single outcome ln Pr(ei |2) considered as a family of functions of 2 indexed by ei is equicontinuous at 2∗ .11 Finally, suppose that all derivatives of ln Pr(ei |2) with respect to 2 are also equicontinuous at 2∗ . Subject to the assumptions in the previous paragraph, it is easily shown ˆ → D(tk2∗ ) + h(t) as N grows large. Next, using the that (−1/N) ln Pr(E|2) covariant derivative with respect to 2 defined in Section 3, let J˜µ1 ,...,µi = ∇µ1 , . . . , ∇µi D(tk2)|2∗ . It also follows that I˜µ1 ,...,µi → J˜µ1 ,...,µi (Balasubramanian 1996). Since the terms in the asymptotic expansion of (1/N)(χE −Nh(t)) (equation 3.4) are continuous functions of ln Pr(E|2) and its derivatives, they individually converge to limits obtained by replacing each I˜ by J˜ and ˆ by D(tk2∗ ) + h(t). Define (−1/N) ln RN (A) to be the sum (−1/N) ln Pr(E|2) of the series of limits of the individual terms in the asymptotic expansion of (1/N)(χE − Nh(t)): # " det Jij (2∗ ) − ln RN (A) 1 d ∗ = D(tk2 ) + ln N − ln N 2N 2N det J˜µν (2∗ ) " # µ ¶ 1 (2π )d/2 1 − ln +O . N V N2

(4.1)

10 In other words, assume that given any neighborhood of 2∗ , 2 ˆ falls in that neighborhood with high probability for sufficiently large N. If the maximum likelihood estimator is inconsistent, statistics has very little to say about the inference of probability densities. 11 In other words, given any ² > 0, there is a neighborhood of M of 2∗ such that for every ei and 2 ∈ M, | ln Pr(ei |2) − ln Pr(ei |2∗ )| < ².

Statistical Inference, Occam’s Razor, and Statistical Mechanics

361

This formal series of limits can be resummed to obtain R RN (A) =

√ dd 2 Je−ND(tk2) R . √ dd 2 J

(4.2)

We encountered RN (A) in Section 2.2 as the quenched approximation to the partition function in equation 2.4. RN (A) will be called the razor of the model family A. The razor, RN (A), is a theoretical index of the complexity of the model family A relative to the true distribution t given N data points. In a certain sense, the razor is the ideal quantity that Bayesian methods seek to estimate from the data available in a given realization of the model inference problem. Indeed, the quenched approximation to the Bayesian partition function consists precisely of averaging over the data in different realizations. The terms in the expansion of the log razor in equation 4.1 are the ideal analogs of the terms in χE since they arise from derivatives of the relative entropy distance between distributions indexed by the model family and the true distribution. The leading term tells us that for sufficiently large N, Bayesian inference picks the model family that comes closest to the true distribution in relative entropy. The subleading terms have the same interpretations as the terms in χE discussed in the previous section, except that they are the ideal quantities to which the corresponding terms in χE tend when enough data are available. The razor is useful when we know the true distribution as well as the model families being used by a particular system and we wish to analyze the expected behavior of Bayesian inference. It is also potentially useful as a tool for modeling and analysis of the general types of phenomena that can occur in Bayesian inference: different relative entropy distances D(tk2) can yield radically different learning behaviors, as discussed in the next section. The razor is considerably easier to analyze than the full Bayesian posterior probability since the quenched approximation to equation 2.4 given in equation 4.2 defines a statistical mechanics on the space of distributions in which the “disorder” has been averaged out. The tools of statistical mechanics can then be straightforwardly applied to a system with temperature 1/N and energy function D(tk2). 4.1 Comparison with Other Model Selection Procedures. The first two terms in χE (A) have appeared together as the stochastic complexity of a model family relative to the true distribution t (Rissanen 1984, 1986). Equivalently, the first two terms of the razor are the expectation value in t of the stochastic complexity of a model family given N data points. The minimum description length (MDL) approach to model selection consists of picking the model that minimizes the stochastic complexity and hence agrees in leading order with the Bayesian approach of this article.

362

Vijay Balasubramanian

In the context of neural networks, other model selection criteria such as the neural information criterion (NIC) of Murata et al. (1994) and the related effective dimension of Moody (1992) have been used.12 Specializing these criteria to the case of density estimation with a logarithmic loss function, ˆ + k/N, where 2 ˆ is the we find the general form NIC = −(1/N) ln Pr(E|2) maximum likelihood parameter and k is an “effective dimension” computed as described in Moody (1992) and Murata et al. (1994).13 The second term, which is the penalty for complexity, is essentially the expected value of the difference between the log likelihoods of the test set and the training set. In other words, the term arises because on general grounds high-dimensional models are expected to generalize poorly from a small training set. Although there are superficial similarities between χE (A) and the NIC, they arise from fundamentally different analyses. When the true distribution t lies within the parameteric family, k equals d, the dimension of the parameter space (Murata et al. 1994). So in this special case, much as χE (A) converges to the razor, the two terms in the NIC converge in probability to NIC−h(t) → ˜ ∗ ) so that to O(1/N), S = D(tk2∗ )+d/N. In the same special case J(2∗ ) = J(2 the razor is given by (−1/N) ln RN (A) = D(tk2∗ )+d(ln N)/2N, which differs significantly from S. Since χE (A) agrees in leading order with the MDL criterion, models selected using the Bayesian procedure are expected to yield minimal codes for data generated from the true distribution. It is not clear that the NIC has this property. Further analysis is required to understand the advantages of the different criteria. 5 Biophysical Relevance and Some Open Questions The general framework described here is relevant to biophysics if we believe that neural systems optimize their accumulation of information from a statistically varying environment. This is likely to be true in at least some circumstances since an organism derives clear advantages from rapid and efficient detection and encoding of information. For example, see the discussions of Bialek (1991) and Atick (1991) of neural signal processing systems that approach physical and information theoretic limits. A creature such as a fly is faced with the problem of estimating the statistical profile of its environment from the small amount of data available at its photodetectors. The general formalism presented in this article applies to such problems, and an optimally designed fly would implement the formalism subject to the constraints of its biological hardware. In this section we examine several interesting questions in the theory of learning that can

12

I thank an anonymous referee for suggesting the comparison. Murata et al. (1994) consider a situation in which the learning algorithm finds an ˆ rather than 2 ˆ itself, but we will not consider this more general scenario estimator of 2 here. 13

Statistical Inference, Occam’s Razor, and Statistical Mechanics

363

be discussed effectively in the statistical mechanical language introduced here. First, consider the possibility of phase transitions in the disordered partition function that describes the Bayesian posterior probability or in the quenched approximation defining the razor. Phase transitions arise from a competition between entropy and energy, which, in the present context, is a competition between simplicity and accuracy. We should expect the existence of systems in which inference at small N is dominated by “simpler” and more “robust” saddlepoints, whereas at large N more “accurate” saddlepoints are favored. As discussed in Section 3.1, the distributions in the neighborhood of simpler and more robust saddlepoints are more concentrated near the true.14 The transitions between regimes dominated by these different saddlepoints would manifest themselves as singularities in the perturbative methods that led to the asymptotic expansions for χE and ln RN (A). The phase transitions are interesting even when the task at hand is not the comparison of model families but merely the selection of parameters for a given family. In Section 3.1 we interpreted the terms of O(1) in χE as measurements of the robustness or naturalness of a model. These robustness terms can be evaluated at different saddlepoints of a given model, and a more robust point may be preferable at small N since the parameter estimation would then be less sensitive to fluctuations in the data. So far we have concentrated on the behavior of the Bayesian posterior and the razor as a function of the number of data points. Instead we could ask how they behave when the true distribution is changed. For example, this can happen in a biophysical context if the environment sensed by a fly changes when it suddenly finds itself indoors. In statistical mechanical terms, we wish to know what happens when the energy of a system is time dependent. If the change is abrupt, the system will dynamically move between equilibria defined by the energy functions before and after the change. If the change is very slow, we would expect adaptation that proceeds gradually. In the language of statistical inference, these adaptive processes correspond to the learning of changes in the true distribution. A final question that has been touched on but not analyzed in this article is the influence of the geometry of parameter manifolds on statistical inference. As discussed in Section 3.1, terms of O(1/N) and smaller in the asymptotic expansions of the log Bayesian posterior and the log razor depend on details of the geometry of the parameter manifold. It would be very interesting to understand the precise meaning of this dependence.

14 In Section 3.1 I discussed how Bayesian inference embodies Occam’s razor by penalizing complex families until the data justify their choice. Here we are discussing Occam’s razor for choice of saddlepoints within a given family.

364

Vijay Balasubramanian

6 Conclusion In this paper I have cast parameteric model selection as a disordered statistical mechanics on the space of probability distributions. A low-temperature expansion was used to develop the asymptotics of Bayesian methods beyond the analyses available in the literature, and it was shown that Bayesian methods for model family inference embody Occam’s razor. While reaching these results, I derived and discussed a novel interpretation of Jeffreys’ prior density as the uniform prior on the probability distributions indexed by a parameteric family. By considering the large N limit and the quenched approximation of the disordered system implemented by Bayesian inference, we derived the razor, a theoretical index of the complexity of a parameteric family relative to a true distribution. Finally, in view of the analog statistical mechanical interpretation, I discussed various interesting phenomena that should be present in systems that perform Bayesian learning. It is easy to create models that display these phenomena simply by considering families of distributions for which D(tk2) has the right structure. It would be interesting to examine models of known biophysical relevance to see if they exhibit such effects, so that experiments could be carried out to verify their presence or absence in the real world. This project will be left to a future publication. Appendix: Measuring Indistinguishability of Distributions Let us take 2p and 2q to be points on a parameter manifold. Since we are working in the context of density estimation, a suitable measure of the distinguishability of 2p and 2q should be derived by taking N data points drawn from either p or q and asking how well we can guess which distribution produced the data. If p and q do not give very distinguishable distributions, they should not be counted separately since that would count the same distribution twice. Precisely this question of distinguishability is addressed in the classical theory of hypothesis testing. Suppose {e1 , . . . , eN } ∈ EN are drawn iid from one of f1 and f2 with D( f1 k f2 ) < ∞. Let AN ⊆ EN be the acceptance region for the hypothesis that the distribution is f1 and define the error probabilities αN = f1N (ACN ) and βN = f2N (AN ). (ACN is the complement of AN in EN , and f N denotes the product distribution on EN describing N iid outcomes drawn from f .) In these definitions αN is the probability that f1 was mistaken for f2 , and βN is the probability of the opposite error. Stein’s lemma tells us how low we can make βN given a particular value of αN . Indeed, let us define ² = minAN ⊆EN , αN ≤² βN . Then Stein’s lemma tells us that βN 1 ² ln βN = −D( f1 k f2 ) . ²→0 N→∞ N lim lim

(A.1)

Statistical Inference, Occam’s Razor, and Statistical Mechanics

365

By examining the proof of Stein’s lemma (Cover and Thomas 1991), we find that for fixed ² and sufficiently large N, the optimal choice of decision region ² : places the following bound on βN −D( f1 k f2 ) − δN +

1 ln (1 − αN ) ² ≤ ln βN N N

≤ −D( f1 k f2 ) + δN +

ln (1 − αN ) , N

(A.2)

where αN < ² for sufficiently large N. The δN are any sequence of positive constants that satisfy the property that αN =

f1N

¯ Ã¯ ! N ¯ ¯1 X f1 (ei ) ¯ ¯ − D( f1 k f2 )¯ > δN ≤ ² ln ¯ ¯ ¯ N i=1 f2 (ei )

(A.3)

P for all sufficiently large N. Now (1/N) N i=1 ln( f1 (ei )/f2 (ei )) converges to D( f1 k f2 ) by the law of large numbers since D( f1 k f2 ) = E f1 (ln( f1 (ei )/f2 (ei )). So for any fixed δ we have: ¯ ! Ã¯ N ¯ ¯1 X f (e ) 1 i ¯ ¯ N − D( f1 k f2 )¯ > δ < ² ln f1 ¯ ¯ ¯ N i=1 f2 (ei )

(A.4)

for all sufficiently large N. For a fixed ² and a fixed N, let 1²,N be the collection of δ > 0 that satisfy equation A.4. Let δ²N be the infimum of the set 1²,N . Equation A.4 guarantees that for any δ > 0, for any sufficiently large N, 0 < δ²N < δ. We conclude that δ²N chosen in this way is a sequence that converges to zero as N → ∞ while satisfying the condition in equation A.3 that is necessary for proving Stein’s lemma. We will now apply these facts to the problem of distinguishability of points on a parameter manifold. Let 2p and 2q index two distributions on a parameter manifold and suppose that we are given N outcomes generated independently from one of them. We are interested in using Stein’s lemma to determine how distinguishable 2p and 2q are. By Stein’s lemma, −D(2p k2q ) − δ²N (2q ) +

β ² (2q ) ln(1 − αN ) ≤ N N N ≤ −D(2p k2q ) + δ²N (2q ) +

ln(1 − αN ) , N

(A.5)

² where we have written δ²N (2q ) and βN (2q ) to emphasize that these quantities are functions of 2q for a fixed 2p . Let A = −D(2p k2q )+(1/N) ln(1−αN )

366

Vijay Balasubramanian

be the average of the upper and lower bounds in equation A.5. Then A ≥ −D(2p k2q ) + (1/N) ln(1 − ²) because the δ²N (2q ) have been chosen to satisfy equation A.5. We now define the set of distributions UN = {2q : −D(2p k2q ) + (1/N) ln(1 − ²) ≥ (1/N) ln β ∗ } where 1 > β ∗ > 0 is some fixed constant. Note that as N → ∞, D(2p k2q ) → 0 for 2q ∈ UN . We want to show that UN is a set of distributions that cannot be well distinguished from 2p . The first way to see this is to observe that the average of the upper and ² is greater than or equal to ln β ∗ for 2q ∈ UN . So in lower bounds on ln βN ² exceeds β ∗ for 2q ∈ UN . this loose, average sense, the error probability βN More carefully, note that (1/N) ln(1−αN ) ≥ (1/N) ln(1−²) by choice of the ² (2q ) ≥ (1/N) ln β ∗ − δ²N (2q ). So using equation A.5, we see that (1/N) ln βN δ²N (2q ). Exponentiating this inequality we find that £ ² ¤(1/N) 1 ≥ βN (2q ) ≥ (β ∗ )(1/N) e−δ²N (2q ) .

(A.6)

The significance of this expression is best understood by considering parameteric families in which for every 2q , Xq (ei ) = ln(2p (ei )/2q (ei )) is a random variable with finite mean and bounded variance, in the distribution indexed by 2p . In that case, taking b to be the bound on the variances, Chebyshev’s inequality says that 2pN

¯ ! Ã¯ N ¯ ¯1 X b Var(X) ¯ ¯ ≤ 2 . Xq (ei ) − D(2p k2q )¯ > δ ≤ 2 ¯ ¯ ¯ N i=1 δ N δ N

(A.7)

In order to satisy αN ≤ ², it suffices to choose δ = (b/N²)1/2 . So if the bounded variance condition is satisfied, δ²N (2q ) ≤ (b/N²)1/2 for any 2q , and therefore we have the limit limN→∞ sup2q ∈UN δ²N (2q ) = 0. Applying this limit to equation A.6, we find that 1 ≥ lim

inf

N→∞ 2q ∈UN

£

¤(1/N)

² (2q ) βN

≥ 1 × lim

inf e−δ²N (2q ) = 1 .

N→∞ 2q ∈UN

(A.8)

² (2q )](1/N) = 1. This is to In summary, we find that limN→∞ inf2q ∈UN [βN ² be contrasted with the behavior of βN (2q ) for any fixed 2q 6= 2p for which ² (2q )](1/N) = exp −D(2p k2q ) < 1. We have essentially shown limN→∞ [βN that the sets UN contain distributions that are not very distinguishable from ² for distinguishing between 2p . The smallest one-sided error probability βN 2p and 2q ∈ UN remains essentially constant, leading to the asymptotics in equation A.8. Define κ ≡ − ln β ∗ + ln(1 − ²) so that we can summarize the region UN of high probability of error β ∗ at fixed ² as κ/N ≥ D(θp kθq ). In this region, the distributions are indistinguishable from 2p with error probabilities αN ≤ ² ² (1/N) and (βN ) ≥ (β ∗ )(1/N) exp −δ²N .

Statistical Inference, Occam’s Razor, and Statistical Mechanics

367

Acknowledgments I thank Kenji Yamanishi for several fruitful conversations. I have also had useful discussions and correspondence with Steve Omohundro, Erik Ordentlich, Don Kimber, Phil Chou, and Erhan Cinlar. Finally, I am grateful to Curt Callan for supporting and encouraging this project and to Phil Anderson for helping with travel funds to the 1995 Workshop on Maximum Entropy and Bayesian Methods. This work was supported in part by DOE grant DE-FG02-91ER40671. References Amari, S. I. 1985. Differential Geometrical Methods in Statistics. Springer-Verlag, Berlin. Amari, S. I., Barndorff-Nielsen, O. E., Kass, R. E., Lauritzen, S. L., and Rao, C. R. 1987. Differential Geometry in Statistical Inference, vol. 10. Lecture NoteMonograph Series. Institute of Mathematical Statistics, Hayward, CA. Atick, J. J. 1991. Could information theory provide an ecological theory of sensory processing? In Princeton Lectures in Biophysics, W. Bialek, ed., pp. 223–291. World Scientific, Singapore. Balasubramanian, V. 1996. A geometric formulation of Occam’s razor for inference of parameteric distributions. Available as preprint number adaporg/9601001 from http://xyz.lanl.gov/ and as Princeton University Physics Preprint PUPT-1588. Barron, A. R. 1985. Logically smooth density estimation. Ph.D. thesis, Stanford University. Barron, A. R., and Cover, T. M. 1991. Minimum complexity density estimation. IEEE Transactions on Information Theory 37(4), 1034–1054. Bialek, W. 1991. Optimal signal processing in the nervous system. In Princeton Lectures in Biophysics, W. Bialek, ed., pp. 321–401. World Scientific, Singapore. Clarke, B. S., and Barron, A. R. 1990. Information-theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory 36(3), 453–471. Cover, T. M., and Thomas, J. A. 1991. Elements of Information Theory. Wiley, New York. Itzykson, C., and Drouffe, J. M. 1989. Statistical Field Theory. Cambridge University Press, Cambridge. Jeffreys, H. 1961. Theory of Probability. 3d ed. Oxford University Press, New York. Lee, P. M. 1989. Bayesian Statistics: An Introduction. Oxford University Press, New York. Ma, S. K. 1985. Statistical Mechanics. World Scientific, Singapore. MacKay, D. J. C. 1992a. Bayesian interpolation. Neural Computation 4(3), 415–447. MacKay, D. J. C. 1992b. A practical Bayesian framework for backpropagation networks. Neural Computation 4(3), 448–472. Moody, J. 1992. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. Proceedings of NIPS 4, 847–854.

368

Vijay Balasubramanian

Murata, N., Yoshizawa, S., and Amari, S. 1994. Network information criterion— determining the number of hidden units for artificial neural network models. IEEE Trans. Neural Networks 5, 865–872. Potters, M., and Bialek, W. 1994. Statistical mechanics and visual signal processing. Journal de Physique I 4(11), 1755–1775. Rissanen, J. 1984. Universal coding, information, prediction and estimation. IEEE Transactions on Information Theory 30(4), 629–636. Rissanen, J. 1986. Stochastic complexity and modelling. Annals of Statistics 14(3), 1080–1100. de Ruyter van Steveninck, R. R., Bialek, W., Potters, M., Carlson, R. H., and Lewen, G. D. 1995. Adaptive movement computation by the blowfly visual system. Princeton Lectures in Biophysics. Wallace, C. S., and Freeman, P. R. 1987. Estimation and inference by compact coding. Journal of the Royal Statistical Society 49(3), 240–265. Yamanishi, K. 1995. A decision-theoretic extension of stochastic complexity and its applications to learning. Submitted for publication.

Received January 16, 1996; accepted April 24, 1996.

Communicated by Steven Nowlan

Bias/Variance Analyses of Mixtures-of-Experts Architectures Robert A. Jacobs Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627 USA

This article investigates the bias and variance of mixtures-of-experts (ME) architectures. The variance of an ME architecture can be expressed as the sum of two terms: the first term is related to the variances of the expert networks that comprise the architecture and the second term is related to the expert networks’ covariances. One goal of this article is to study and quantify a number of properties of ME architectures via the metrics of bias and variance. A second goal is to clarify the relationships between this class of systems and other systems that have recently been proposed. It is shown that in contrast to systems that produce unbiased experts whose estimation errors are uncorrelated, ME architectures produce biased experts whose estimates are negatively correlated. 1 Introduction Many researchers recently have studied statistical models that estimate the value of a random variable by combining the estimates of other models, henceforth referred to as experts. Theoretical and experimental results have established that when the experts are unbiased estimators, combination procedures are most effective when the experts’ estimates are negatively correlated; they are moderately effective when the experts are uncorrelated and only mildly effective when the experts are positively correlated. As an example of an analysis of combinations that use unbiased experts, Clemen and Winkler (1985) quantified the utility of positively correlated, uncorrelated, and negatively correlated experts when the experts’ estimates are combined via Bayes’ rule. These investigators represented the statistical dependence among the experts by the dependence among their estimation errors and then considered the number of independent experts that carry the same amount of information as a given number of dependent experts. Given certain assumptions, Clemen and Winkler showed that the number of independent experts that are worth the same as an infinite number of positively correlated experts is equal to the inverse of the correlation among the experts. This limit is surprisingly low. If, for instance, the correlation among the experts is 0.5, then an infinite number of such experts carry as much information as only two independent experts. After the first expert, which is worth one independent expert, all other experts combined are worth only Neural Computation 9, 369–383 (1997)

c 1997 Massachusetts Institute of Technology °

370

Robert A. Jacobs

one additional independent expert. The opposite situation is found when negative dependence among the experts is considered. That is, whereas positive dependence among the experts diminishes the effectiveness of a combination procedure, negative dependence increases it. As a consequence of these types of theoretical results, as well as of empirical studies that show that these general findings also hold in practice (Perrone 1993; Hashem 1993), several researchers have investigated architectures and training procedures for obtaining unbiased experts whose estimation errors are uncorrelated (Meir 1994; Raviv and Intrator, in press; Tresp and Taniguchi 1995). In contrast, this article studies a class of architectures, known as mixtures-of-experts (ME) architectures, that adopt a very different strategy. The analyses are conducted by estimating the bias and variance of these models under a variety of conditions. Based on a result due to Meir (1994), it is shown that the architecture’s variance can be expressed as the sum of two terms: the first related to the variances of the experts that comprise the architecture and the second related to the experts’ covariances. One goal of this article is to study and quantify a number of properties of ME architectures via the metrics of bias and variance. Because these systems consist of several interacting components, their performance properties can be difficult to understand in a rigorous way. The analyses presented here are useful in this regard. A second goal is to clarify the relationships between this class of architectures and other architectures that have recently been proposed. Instead of producing unbiased experts whose estimation errors are uncorrelated, ME architectures produce biased experts whose estimates are negatively correlated. The article is organized as follows. Section 2 briefly overviews the ME architecture and its associated learning procedure. Section 3 presents the equations for computing the bias and variance of these architectures. Section 4 presents the results of estimating the biases and variances of ME architectures operating with different numbers of components, operating with different parameter settings, and operating in different noise environments. 2 Mixtures-of-Experts Architectures The architectures studied in this article are members of the ME family of architectures. This family is of interest on both theoretical and empirical grounds. From a theoretical viewpoint, the architectures combine aspects of finite mixture models and generalized linear models, two well-studied statistical frameworks (Jordan and Jacobs 1994; McCullagh and Nelder 1989; McLachlan and Basford 1988). From an empirical viewpoint, they have been shown to be capable of comparatively fast learning and good generalization on a wide variety of regression and classification tasks (Jacobs et al. 1991; Jordan and Jacobs 1994; Nowlan and Hinton 1991; Waterhouse and Robinson 1994).

Bias/Variance Analyses of Mixtures-of-Experts Architectures

371

ME architectures are multinetwork, or modular, architectures that combine aspects of competitive and associative learning. Different architectures within this framework may be formed by placing the networks in different structural arrangements. A probabilistic interpretation exists such that for each arrangement of networks, there is a corresponding likelihood function that characterizes an architecture’s performance. Learning occurs by maximizing the likelihood function. ME architectures attempt to solve problems using a “divide-and-conquer” strategy; that is, complex problems are decomposed into a set of simpler subproblems. It is assumed that the data can be adequately summarized by a collection of functions, each defined over a local region of the input space. ME architectures adaptively partition the input space into possibly overlapping regions and allocate different networks to summarize the data located in different regions. This section briefly overviews the ME architectures used in this article. More extensive discussions of ME architectures can be found in Jacobs et al. (1991), Jacobs and Jordan (1991, 1993), Jordan and Jacobs (1994), Jordan and Xu (1993), Nowlan and Hinton (1991), and Peng et al. (1996). For the purposes of this article, it is assumed that the data are generated by a number of different probabilistic rules. Let D = {(x(t) , y(t) )} denote the collection of training data, where x is an input vector, y is a target output, and t is a time index. At each time step, a rule is selected from a condi(t) tional multinomial distribution with probability g(t) i = p(i|x , V), where i indexes a rule and V = [v1 , . . . , vI ] is the matrix of parameters underlying the distribution. The selected rule generates an output y with probability p(y(t) |x(t) , Ui , 8i ), where Ui is a parameter matrix and 8i represents other (possibly nuisance) parameters. In the case of regression, each rule is characterized by a statistical model of the form y = fi (x) + ²i , where fi (x) is a fixed linear function of the input vector, and ²i is a random variable. If it is assumed that ²i is gaussian with a mean of zero and a variance of σi2 , and if it is assumed that the data are independent and identically distributed, then the likelihood of generating the data is proportional to the finite mixture density L(Θ|D) ∝

YX t

∝

(2.1)

i

YX t

p(i|x(t) , V)p(y(t) |x(t) , Ui , 8i )

i

g(t) i

1 − 2σ12 e i σi

2 (y(t) −µ(t) i )

(2.2)

(t) T where µ(t) i = fi (x ), and Θ = [v1 , . . . , vI , U1 , . . . , UI , 81 , . . . , 8I ] is the matrix of all parameters. The ME architectures studied here consist of two types of networks: a gating network and a number of expert networks. The gating network models the input-dependent multinomial distribution used to select a rule. The

372

Robert A. Jacobs

expert networks model the input-dependent statistical models associated with the different rules. At each time step, all networks receive the vector x as input. The output of expert network i, denoted µi , is a linear function of its input. The outputs of the gating network are computed using the “softmax” function (Bridle 1989); specifically, the activation of the ith output unit of the gating network, denoted gi , is gi =

esi /τ , I X sj /τ e

(2.3)

j=1

where τ is a temperature parameter, si denotes the weighted sum of unit i’s inputs, and I denotes the number of expert networks. The output of the architecture as a whole, given by µ=

I X

gi µi ,

(2.4)

i=1

is a convex combination of the experts’ outputs. The parameters of the gating and experts networks are adapted so as to maximize the likelihood function given in equation 2.2. This maximization is performed using the expectation-maximization (EM) algorithm (see Jordan and Jacobs 1994 for the EM equations). 3 Bias/Variance Measures The expected squared error of an estimator on a particular data item may be expressed as the sum of two terms, E[(y − µ)2 |x, D] = E[(y − E[y|x])2 | x, D] + (µ − E[y | x]|x, D)2 ,

(3.1)

where x is the input vector, y is the target output, µ is the estimator’s approximation, D is the set of training data used to train the estimator, and E is the expectation operator taken with respect to the probability distribution p(y|x). Because the first term does not depend on the data, only the second term is considered below. The expected value of the second term with respect to the data can also be written as the sum of two terms, ED [(µ − E[y|x])2 ] = (ED [µ] − E[y|x])2 + ED [(µ − ED [µ])2 ] ,

(3.2)

where ED is the expectation operator taken with respect to the data. The first term is the square of the bias of an estimator, and the second term is the estimator’s variance.

Bias/Variance Analyses of Mixtures-of-Experts Architectures

373

ME architectures produce an estimate of a target output by linearly combining the estimates of several experts. Consequently, the bias and variance of an ME architecture can be written in terms of the gating network outputs (the linear coefficients) and the expert network outputs. This leads to an interesting decomposition in the case of the variance of an ME architecture. Based on a result due to Meir (1994), the variance may be expressed as the sum of two terms, " # X 2 2 (gi µi − ED [gi µi ]) (3.3) ED [(µ − ED [µ]) ] = ED i

 XX (gi µi − ED [gi µi ]) (gj µj − ED [gj µj ]) , + ED  

i

i6= j

where the first term is the variance of the weighted outputs of the individual experts, and the second term is the covariance of the experts’ weighted outputs. Combining equations 3.2 and 3.3, the expected squared difference between the estimate of an ME architecture and the expected value of a regression function is the sum of the architecture’s squared bias and the variance and covariance of the experts’ weighted outputs. It is possible to gain insight into the performance characteristics of ME architectures by analyzing these systems in terms of these quantities. 4 Simulation Results This section reports the results of estimating the bias of ME architectures and the variance and covariance of experts’ weighted outputs on a regression task. The regression function is " # ¶ µ 1 2 1 10 sin(π x1 x2 ) + 20 x3 − + 10 x4 + 5 x5 − 1 , f (x) = 13 2

(4.1)

where x = [x1 , . . . , x5 ] is an input vector whose components lie between zero and one. The value of f (x) lies in the interval [−1, 1]. Target outputs are created by adding noise sampled from a gaussian distribution with a mean of zero and a variance of σ 2 to the function f . This regression task is a linearly scaled version of a task that has been used by other investigators to evaluate statistical estimators (e.g., Friedman 1991; Friedman et al. 1983). Twenty-five training sets were created. Each set consisted of 500 inputoutput patterns in which the components of the input vectors were independently sampled from a uniform distribution over the interval (0, 1). A test set of 1024 input-output patterns was also created. For this set, the input vectors were uniformly spaced in the five-dimensional input space, and the target

374

Robert A. Jacobs

outputs were not corrupted by noise. Twenty-five simulations of each architecture were conducted. In each simulation, the architecture was trained on a different training set. Simulations lasted for 60 epochs, and architectures were evaluated on the test set at every fifth epoch. The variance parameter associated with each expert network was set to the variance of the noise added to the training data. An architecture was initialized with the same set of small, random weights at the start of each simulation. Consequently, different simulations of an architecture yielded different performances solely due to the use of different training sets. The equations for estimating the quantities of interest are closely related to the equations for estimating the bias and variance of neural networks presented by Geman et al. (1992). The average output of an architecture on pattern t from the test set, denoted µ(t) , is given by µ(t) =

N 1 X µ(t,n) , N n=1

(4.2)

where µ(t,n) is the output on the nth simulation and N is the number of simulations. Similarly, the average weighted output of expert network i on (t) pattern t, denoted g(t) i µi , is given by (t) g(t) i µi =

N 1 X g(t,n) µ(t,n) , i N n=1 i

(4.3)

is the activation of the ith unit of the gating network on simulawhere g(t,n) i tion n, and µ(t,n) is the output of expert network i on simulation n. Define the i integrated bias, meaning the squared bias averaged over the set of inputoutput patterns in the test set, to be Integrated bias ≡

T 1X |µ(t) − y(t) |2 , T t=1

(4.4)

where y(t) is the target output on pattern t and T is the total number of patterns. The integrated variance and integrated covariance of the experts’ weighted outputs are defined in analogous ways: Integrated variance ≡

T N X1X 1 X (t) 2 (g(t,n) µ(t,n) − g(t) i µi ) i i i T t=1 N n=1

(4.5)

Integrated covariance ≡ T N XX 1 X 1 X (t) (t,n) (t,n) (g(t,n) µ(t,n) − g(t) µj − gj(t) µj(t) ) . i µi ) (gj i i T N i j6=i t=1 n=1

(4.6)

Bias/Variance Analyses of Mixtures-of-Experts Architectures

375

Figure 1: Integrated bias, variance, covariance, and mean-squared error of architectures with 4, 8, or 12 expert networks.

It is also useful to define the integrated mean-squared error (MSE): Integrated MSE ≡

T N 1 X 1X |µ(t,n) − y(t) |2 . T t=1 N n=1

(4.7)

Equivalently, the integrated MSE can be expressed as the sum of the integrated bias, variance, and covariance. The first experiment that we conducted evaluated ME architectures with different numbers of expert networks. Systems with 4, 8, and 12 experts were used. The temperature parameter in the softmax activation function used by the gating network (see equation 2.3) was set to 1, and the variance of the noise added to the target function was set to 0.1. The results are presented in Figure 1. The training time is given by the horizontal axis of each graph in this figure. The integrated bias is given by the vertical axis of the upper-left graph; the integrated variance is given by the vertical axis of the upper-right graph; the lower-left graph gives the integrated covariance; and the lowerright graph gives the integrated MSE. In all four graphs, the solid line gives the results for the architecture with 12 experts; the dense dashed line and the sparse dashed line give the results for the architectures with 8 and 4 experts, respectively (a legend is shown in the upper-left graph). As expected, the integrated bias declined with training time for all architectures. The 12-expert and 8-expert systems have more computational resources than the 4-expert system and thus achieved smaller bias. The in-

376

Robert A. Jacobs

Figure 2: Correlations among the expert networks’ outputs for architectures with 4, 8, and 12 experts at epochs 20 and 60.

tegrated variance for all architectures increased with training time. This is particularly true for the system with 12 experts. In contrast, the covariance for all architectures decreased as training progressed. Because the variance increased significantly faster than the covariance decreased, the systems eventually overfit the training data as evidenced by the integrated MSE. Overfitting was most severe for the 12-expert system. For our purposes, it is important to note that the integrated covariance became negative during the course of training. Although this covariance is based on the expert networks’ weighted outputs (as weighted by the gating network), its negative value suggests that it is likely that the experts’ unweighted outputs also became negatively correlated. In order to evaluate this hypothesis, we measured the correlations among these outputs (averaged over the 25 simulations of each architecture) at epoch 20 (prior to overfitting) and at epoch 60 (overfitting has occurred). The results are shown in Figure 2. The three rows of bar graphs correspond to the systems with 4, 8, and 12 expert networks; the two columns correspond to epochs 20 and 60. The vertical axis of each graph gives the value of a correlation. A correlation is represented by a vertical bar with one end point at zero and the other end point at the value of the represented correlation.

Bias/Variance Analyses of Mixtures-of-Experts Architectures

377

Figure 3: Squared biases of the individual expert networks that comprise architectures with 4, 8, and 12 experts at epochs 20 and 60.

The results suggest that the ME training procedure leads to negatively correlated experts. In addition, the correlations become more negative as training proceeds. The negative correlations stem from the fact that each ME system adaptively partitions the input space into regions such that the target function has different properties in each region. Different experts learn the data located in different regions. Because the correlations for the 4-expert system, which has relatively few computational resources, were all negative, it may be said that the ME architecture was efficient in the sense that it adapted the experts so that different experts provided informationally different “basis” functions. Although some of the correlations of the 8- and 12-expert systems were positive in value, the majority of the correlations were negative. The ME architecture tends to produce negatively correlated experts and thus should be preferable to systems that produce uncorrelated and unbiased experts only if its experts are also unbiased. To evaluate this possibility, we estimated the squared bias of the individual experts. The results are shown in Figure 3. The bar graphs in this figure correspond to the sys-

378

Robert A. Jacobs

tems with 4, 8, and 12 expert networks evaluated at epochs 20 and 60. The vertical axis of each graph gives the estimated bias of an individual expert, and different bars correspond to different experts. Despite the fact that the bias of an ME architecture was nearly zero, the squared biases of the individual experts comprising the architecture were positive. The biases of experts in larger architectures were generally greater than those of experts in smaller architectures. Furthermore, the expert biases increased with additional training. The experiments reported here help clarify the relationships between the performance characteristics of ME architectures and those of some systems studied by other researchers. As noted above, several investigators are seeking training methods for achieving unbiased experts whose estimation errors are uncorrelated (Meir 1994; Raviv and Intrator, in press; Tresp and Taniguchi 1995). As just one example, Raviv and Intrator (in press) use a noisy sampling technique (bootstrapping) in order to create sets of independent training samples. Uncorrelated experts are produced by training different experts on different training sets. In contrast, ME architectures and their associated training procedures adopt a very different strategy. They tend to produce experts that are biased and negatively correlated. The next experiment studied the effects of varying the temperature parameter in the softmax function used by the gating network. Jordan and Jacobs (1994) claimed that an advantage of ME architectures over other treestructured statistical estimators, such as CART (Breiman el al. 1984) or C4.5 (Quinlan 1993), is that they use soft splits of the input space instead of hard splits, meaning that regions of the input space defined by the architecture can overlap and that data items can lie simultaneously in multiple regions. Estimators that use hard splits are likely to have a larger variance than estimators that use soft splits. In order to verify and quantify this claim, we compared the performances of an ME architecture when the temperature was small (τ = 0.025) to that when the temperature was large (τ = 1.0). The splits of an architecture with a near-zero temperature are more like hard splits, especially during the early stages of training when the gating network’s weights are small. The architecture that was used had 8 expert networks. The variance of the noise added to the target function was 0.1. The results are presented in Figure 4 (this figure has the same format as Fig. 1). The solid line gives the performance of the architecture with a large temperature; the dashed line is for the case of a small temperature. The integrated bias of the large-temperature architecture decreased more slowly than that of the small-temperature architecture during the early stages of training but eventually reached a lower value. Its integrated variance was significantly lower, and its integrated covariance was only moderately higher. Thus, these outcomes are consistent with the claim of Jordan and Jacobs (1994). The MSE of the large-temperature system was smaller. Overall, the results suggest that the system with a low temperature tended to commit early in the training process to a particular partition of the input

Bias/Variance Analyses of Mixtures-of-Experts Architectures

379

Figure 4: The integrated bias, variance, covariance, and mean-squared error for architectures when the temperature parameter τ was set to 1.0 or 0.025.

space. Because the architecture at this stage of training is highly sensitive to the idiosyncrasies of both its training data and its initial weight settings, the selected partitions frequently lead to poor generalization performance. We have also analyzed architectures when the variance of the noise added to the target function was varied. Large noise (σ 2 = 0.2), moderate noise (σ 2 = 0.1), and small noise (σ 2 = 0.05) conditions were studied. The ME architecture had 8 expert networks. The temperature τ was set to 1. The results are shown in Figure 5. The solid line is for the large noise condition; the dense and sparse dashed lines correspond to the moderate noise and small noise conditions, respectively. As expected, the integrated bias of the architecture decreased more slowly when the noise was relatively large. The architecture’s integrated variance grew slowly during early stages of training and more rapidly during later stages under this condition. Similarly, its covariance decreased slowly initially but then eventually decreased rapidly in the large noise case. The integrated MSE is smallest for the low noise case. In sum, performance was best when the signal-to-noise ratio of the training data was high and degraded gracefully with decreasing values of this ratio. 5 Conclusions This article has investigated the bias and variance of several ME architectures. It was shown that the variance of an ME architecture can be expressed as the sum of two terms: the first is related to the variances of the experts that

380

Robert A. Jacobs

Figure 5: The integrated bias, variance, covariance, and mean-squared error for architectures when the variance of the noise added to the target function was 0.2, 0.1, or 0.05.

comprise the architecture, the second is related to the experts’ covariances. One goal of this article was to study and quantify a number of properties of ME architectures. A second goal was to clarify the relationships between this class of systems and other systems that have recently been proposed. In contrast to systems that produce unbiased experts whose estimation errors are uncorrelated, ME architectures produce biased experts whose estimates are negatively correlated. The fact that ME architectures produce biased experts may be seen as a disadvantage by some readers. My view is that this property is of no consequence so long as the overall ME architecture is unbiased at the end of training. This appears to be the case, as is evidenced in Figures 1, 4, and 5. Nonetheless, it may be tempting to seek ways to modify the architecture or training procedure so as to produce individual experts with less bias. The most obvious possibility is to add hidden units to the expert networks so that these networks could perform nonlinear regressions. Although this practice would reduce the bias of the individual experts, I do not necessarily recommend it. One drawback of adding hidden units to the experts is that the EM algorithm, extremely efficient for optimizing likelihood functions, could no longer be easily used during training. A second drawback of adding hidden units is that this would result in increases in the variances of the individual experts. The statistical community has generally regarded nonzero variance as a more serious problem than nonzero bias as evidenced

Bias/Variance Analyses of Mixtures-of-Experts Architectures

381

by the community’s focus on the overfitting problem. It is often found that the addition of a small amount of bias to an estimator can result in a larger decrease in an estimator’s variance. The analyses presented here help in understanding a number of regularization procedures that have recently been applied to ME architectures in order to ameliorate the problem of overfitting. One way that overfitting can be lessened is through the use of methods that decrease the variances of the experts. Waterhouse et al. (1996) pursued this approach in a Bayesian framework by assigning prior distributions to the weights of each expert. The parameter values for each prior distribution were estimated from the data. This strategy aims to decrease the experts’ variances without overly increasing the architecture’s bias. Simulation results on two artificial data sets and on a sunspot prediction task showed that the method can be effective in eliminating overfitting. The experts’ variances can also be decreased by methods that reduce the number of free parameters in an ME architecture. Jacobs et al. (1996) used Bayesian sampling techniques in order to detect and eliminate unnecessary expert networks. Simulation results on a speech recognition task and a breast cancer classification task showed that the method led to improved generalization performances. Waterhouse and Robinson (1996) proposed an algorithm that both adds and deletes networks from an ME architecture during the course of training. A different approach to ameliorating overfitting is to increase the degree to which experts are negatively correlated. Although we are not aware of any studies that have pursued this approach, simulation results of Jacobs and Kosslyn (1994) suggest that it might be achieved through the use of expert networks that each have a different topology or each receive a different set of input variables. Acknowledgments I thank M. Jordan for commenting on an earlier version of this article. This work was conducted while I was a visiting scholar at Stanford University. I am grateful to D. Rumelhart and the members of his research laboratory for many interesting discussions. This work was supported by NIH research grant R29-MH54770. References Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. Wadsworth International Group, Belmont, CA. Bridle, J. 1989. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing: Algorithms, Architectures, and Applications, F. Fogelman-Soulie and J. H´erault, eds. Springer-Verlag, New York. Clemen, R. T., and Winkler, R. L. 1985. Limits for the precision and value of information from dependent sources. Operations Research 33, 427–442.

382

Robert A. Jacobs

Friedman, J. H. 1991. Multivariate adaptive regression splines. Annals of Statistics 19, 1–141. Friedman, J. H., Grosse, E., and Stuetzle, W. 1983. Multidimensional additive spline approximation. SIAM Journal of Scientific Computing 4, 291–301. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Computation 4, 1–58. Hashem, S. 1993. Optimal linear combinations of neural networks. Tech. Rep. SMS 94–4, School of Industrial Engineering, Purdue University. Jacobs, R. A., and Jordan, M. I. 1991. A competitive modular connectionist architecture. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds. Morgan Kaufmann, San Mateo, CA. Jacobs, R. A., and Jordan, M. I. 1993. Learning piecewise control strategies in a modular neural network architecture. IEEE Transactions on Systems, Man, and Cybernetics 23, 337–345. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neural Computation 3, 79–87. Jacobs, R. A., and Kosslyn, S. M. 1994. Encoding shape and spatial relations: The role of receptive field size in coordinating complementary representations. Cognitive Science 18, 361–386. Jacobs, R. A., Peng, F., and Tanner, M. A. 1996. A Bayesian approach to model selection in hierarchical mixtures-of-experts architectures. Neural Networks. In press. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures-of-experts and the EM algorithm. Neural Computation 6, 181–214. Jordan, M. I., and Xu, L. 1993. Convergence results for the EM approach to mixturesof-experts architectures. Tech. Rep. 9303, Department of Brain and Cognitive Sciences, MIT. McCullagh, P., and Nelder, J. A. 1989. Generalized Linear Models. Chapman and Hall, London. McLachlan, G. J., and Basford, K. E. 1988. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York. Meir, R. 1994. Bias, variance, and the combination of estimators: The case of linear least squares. Tech. Rep. 922, Department of Electrical Engineering, Technion, Haifa, Israel. Nowlan, S. J., and Hinton, G. E. 1991. Evaluation of adaptive mixtures of competing experts. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds. Morgan Kaufmann, San Mateo, CA. Peng, F., Jacobs, R. A., and Tanner, M. A. 1996. Bayesian inference in mixturesof-experts and hierarchical mixtures-of-experts models with an application to speech recognition. Journal of the American Statistical Association, in press. Perrone, M. P. 1993. Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. Ph.D. thesis, Brown University. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.

Bias/Variance Analyses of Mixtures-of-Experts Architectures

383

Raviv, Y., and Intrator, N. In press. Bootstrapping with noise: An effective regularization technique. Tresp, V., and Taniguchi, M. 1995. Combining estimators using non-constant weighting functions. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, eds. MIT Press, Cambridge, MA. Waterhouse, S. R., MacKay, D., and Robinson, T. 1996. Bayesian methods for mixtures of experts. In Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds. MIT Press, Cambridge, MA. Waterhouse, S. R., and Robinson, A. J. 1994. Classification using hierarchical mixtures of experts. In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, J. Vloutzos, J.-N. Hwang, and E. Wilson, eds. IEEE Press, New York. Waterhouse, S. R., and Robinson, A. J. 1996. Constructive algorithms for hierarchical mixtures of experts. In Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds. MIT Press, Cambridge, MA.

Received September 5, 1995; accepted April 24, 1996.

Communicated by Ramamohan Paturi

The Behavior of Forgetting Learning in Bidirectional Associative Memory Chi Sing Leung Lai Wan Chan Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong

Forgetting learning is an incremental learning rule in associative memories. With it, the recent learning items can be encoded, and the old learning items will be forgotten. In this article, we analyze the storage behavior of bidirectional associative memory (BAM) under the forgetting learning. That is, “Can the most recent k learning item be stored as a fixed point?” Also, we discuss how to choose the forgetting constant in the forgetting learning such that the BAM can correctly store as many as possible of the most recent learning items. Simulation is provided to verify the theoretical analysis. 1 Introduction Associative memory is a major class of neural networks with a wide range of applications, such as content-addressable memory and pattern recognition (Kohonen 1972; Palm 1980). An important feature of associative memory is its associative nature, that is, the ability to recall the stored item based on partial or noisy information. Associative memory encodes and recalls library patterns or pattern pairs. If it encodes single patterns, it is called autoassociative memory. If it encodes pattern pairs, it is called heteroassociative memory. One form of heteroassociative memory is the bivalent additive bidirectional associative memory (BAM) (Kosko 1988). The BAM is used to store bipolar library pairs (Xh , Yh ), where Xh = (x1h , . . . , xnh )T , Yh = (y1h , . . . , yph )T , h = 1, . . . , m, and m is the number of library pairs. The components xih and yjh are bipolar: either +1 or −1. There are two layers in the BAM: layer FX and layer FY , with n and p neurons, respectively. The connection matrix W, proposed by Kosko, is

W=

m X

Yh XhT .

(1.1)

h=1

Neural Computation 9, 385–401 (1997)

c 1997 Massachusetts Institute of Technology °

386

Chi Sing Leung and Lai Wan Chan

The retrieval process is an iterative feedback process that starts with a stimulus vector X(0) in FX : h i Y(v+1) = sgn WX(v) , h i (1.2) X(v+1) = sgn W T Y(v+1) , where sgn is defined as   +1 −1 sgn(x) =  state unchanged

x>0 x<0 x = 0.

Kosko proved that for any real connection matrix, one of fixed points (X f , Y f ) can be obtained from this iterative process. Apparently such a fixed point is desired to be one of the library pairs. A fixed point has the following properties: X f = sgn(WY f ) and Y f = sgn(W T X f ).

(1.3)

Hence, a library pair can be retrieved only if it is a fixed point. With the iterative process, the BAM can achieve both heteroassociative and autoassociative data recollections. The final state in FX represents the autoassociative recall. The final state in FY represents the heteroassociative recall. The storage behavior of the BAM under Kosko’s encoding method and its variants has been studied by various researchers (Haines and Nielsen 1988; Leung et al. 1995; Leung 1993; Wang et al. 1990). An advantage of Kosko’s encoding method is the ability of incremental learning: encoding new library pairs to the model based on only the current connection matrix. However, with Kosko’s encoding method, the BAM can only correctly store up to min(n, p)/2 log min(n, p) library pairs. When the total number of library pairs exceeds that value, all library pairs, including the old and new items, may not be stored as fixed points. To avoid this matter, we may require that the BAM has the ability to forget; that is, the model should be able to create space for the new library pairs. In modeling the forgetting behavior of the BAM, we would like to have three physiological requirements: 1. Locality. Each weight should be updated based on the values of the corresponding connected neurons. 2. Being incremental. The updated rule is based on only the current connection matrix and is independent of which library pairs have been stored. 3. Boundedness. The magnitude of the weights should be bounded. To achieve these requirements, we introduce a decay factor α f ∈ (0, 1), called the forgetting constant, in the original Kosko’s encoding method, W(t) = α f W(t−1) + Yt XtT ,

(1.4)

The Behavior of Forgetting Learning in Bidirectional Associative Memory 387

where W(0) is a zero matrix and (Xt , Yt ) is the new library pair that we want to encode into the BAM. Equation 1.4 represents a simple form of forgetting learning. It is the discrete-time version of the original continuous-time forgetting rule in Kokso (1992). According to the decay factor, the forgetting learning can encode the recent library pairs as fixed points with a high chance and can delete some old stored library pairs from the model. Also, with our suggestion on α f (see Section 3), the magnitude of the weights is up√ per bounded by O(n/log n) in the worst case and upper bounded by O( n) in the probabilistic sense (see Section 4). However, similar to some conventional backpropagation rules, real-valued computations are involved in equation 1.4. The hardware implementation of equation 1.4 is more difficult than that of Kosko’s encoding scheme. We will use simulation to investigate the behavior of equation 1.4 using the fixed-point computation machine described in Section 3. The storage behavior of a Hopfield network under some similar forgetting rules was studied numerically or theoretically by Nadal et al. (1986), Hemmen et al. (1988), and Parisi (1986). They showed that the previous λn library patterns can be stored in a Hopfield network when a small number of errors are allowed in the recalling patterns, where n is the number of neurons in the Hopfield network and λ is a positive constant less than one. (For a brief review, see Hemmen and Kuhn 1991). The learning within bounds (Hopfield 1982) was proposed for the Hopfield network and studied numerically by Parisi (1986). Its updating rule is ³ ´ (1.5) W(t) = φ W(t−1) + Xt XtT , √ √ where φ(x) = x for |x| < A n; otherwise φ(x) = sgn(x)A n. Since we are now discussing the Hopfield model, the second term of the right-hand side in equation 1.5 is the outer product of the library pattern itself. Let k0 be the number of most recently stored patterns that are satisfactorily kept in the memory. Parisi (1986) numerically showed that k0 can be proportional to n when a small number of errors are allowed in the retrieved items. Using the nonrigorous replica method, (Hemmen et al. 1988) showed that with a suitable constant A (≈ 0.35) the recent 0.04n previous library patterns can be kept in the memory when a small number of errors are allowed in the retrieved items. Since learning within bounds involves only integer-valued weights, its hardware implementation is easier than that of equation 1.4. However, the replica method for averaging over the disorder in the system due to the different possible realizations of nominal patterns is strictly valid only for a temperature T0 > 0 (Forrest and Wallace 1991). In fact, the “T0 = 0” in Hemmen et al. (1988) violates this criterion. It should also be noted that for the BAM, the learning within bounds can be used only when p = n. To our knowledge, for the BAM the selection role about the constant A in the learning within bounds has not been reported when p 6= n.

388

Chi Sing Leung and Lai Wan Chan

In Nadal et al. (1986), the marginalist rule was proposed for the Hopfield network. Its updating rule is W(t) = W(t−1) + δ t−1 Xt XtT ,

(1.6)

where δ is a constant larger than one. The concept of the marginalist rule is that the acquisition intensity for learning the last pattern should be greater than the background intensity. Hence, the last pattern can be stored with high chance. Nadal et al. (1986) numerically showed that k0 also can be proportional to n for the marginalist rule when a small number of errors are allowed in the retrieved patterns. However, the role of choosing δ (with which the number of recent library items being nearly correctly stored is as large as possible) was not mentioned in Nadal et al. (1986). Although equations 1.4 and 1.6 are similar after some mathematical treatments, their effects on the magnitude of the weights are different. With equation 1.6, the magnitude exponentially increases. On the other hand, with equation 1.4, the magnitude is upper bounded by O(n/log n) in the √ worst case and upper bounded by O( n) in the probabilistic sense. Note that equation 1.4 can also be written as the sum of the old matrix and an updating matrix: W(t) = W(t−1) + 1W(t) ,

(1.7)

where 1W(t) = (α f − 1)W(t−1) + Yt Xt T . Instead of using the classical nonrigorous replica method and allowing error in the retrieved pattern, our main goal is to study the storage behavior of the BAM under the simple forgetting rule (cf. equation 1.4) when error is not allowed in the retrieved pair. Also, we discuss the role of choosing the forgetting constant α f . Under some assumptions, we will prove the following theorem in the next section. Theorem 1. Under the forgetting learning (cf. equation 1.4), if a BAM is trained with t library pairs and k is less than log

(1−α f2 ) min(n,p) 2 log min(n,p)

2 log α1f

(1.8)

then the probability of the (t − k)th library pair (Xt−k , Yt−k ) being a fixed point tends to 1, as n → ∞, p → ∞.

The Behavior of Forgetting Learning in Bidirectional Associative Memory 389

By theorem 1, given α f , we can determine the number of the most recent library pairs that the BAM can correctly store. Hence, the best value of α f can be determined such that the BAM can correctly store as many of the most recent library pairs as possible. In the next section, we will analyze the storage behavior of the BAM under forgetting learning. We first prove theorem 1. Hence, we can determine whether the most recent k library pair, (Xt−k , Yt−k ), can be stored as a fixed point. We then discuss how to choose the forgetting constant such that log

(1−α f2 ) min(n,p) 2 log min(n,p)

2 log α1f is as large as possible (see corollary 1). In Section 3, the simulation is given for the verification of the analysis. Section 4 addresses the magnitude of the weights in deterministic and probabilistic senses. A brief conclusion comprises Section 5. 2 Properties of Forgetting Learning The following assumptions and notations are used: • The dimensions, n and p, are large. Also, p = rn, where r is a positive constant. • Each component of the library pairs (Xh , Yh ) is a ±1 equiprobable independent random variable. • EUj,t−k is the event that the jth component of sgn(W (t) Xt−k ) is equal to the jth component of Yt−k . Also, EUj,t−k is the complement event of EUj,t−k . T

• EVi,t−k is the event that the ith component of sgn(W (t) Yt−k ) is equal to the ith component of Xt−k . EV i,t−k is the complement event of EVi,t−k . Lemma 1.

Chebyshev’s inequality: For any random variable χ and u ≥ 0,

Prob(χ ≥ u) ≤ inf e−τ u E(eτ χ ). τ ≥0

Lemma 2. Let v1 , v2 , . . . , vN be independent ±1 equiprobable random variables P and SN = N i=1 vi , ·

½

SN E exp τ √ N

¾¸

h

n

oi

≤ E exp τ Sˆ

½

¾ τ2 = exp , 2

390

Chi Sing Leung and Lai Wan Chan

for τ > 0, where Sˆ is a standard normal variable. In general, ·

½

SN E exp τβ √ N

¾¸

½

β 2τ 2 ≤ exp 2

¾

where β is a real number. Lemma 3.

The probability Prob(EUj,t−k ) is less than

( exp −

(1 − α f2 )α f2k n

)

2

for j = 1, . . . , p. Proof of Lemma 3. Without loss of generality, we consider that the library pair (Xt−k , Yt−k ) has all positive components; that is, Xt−k = (1, . . . , 1)T and Yt−k = (1, . . . , 1)T . Then, 

 √ S j,h α ft−h √ > α fk n , Prob(EUj,t−k ) = Prob  n h=1,h6=t−k t X

(2.1)

where Sj,h = yjh

n X

xih .

i=1

From lemma 2, ·

½

Sj,h E exp τ α ft−h √ n

¾¸ ≤ exp

   α 2t−2h τ 2  f





2

.

By Chebyshev’s inequality (cf. lemma 1),

Prob(EUj,t−k ) ≤ inf exp{−τ u} exp τ ≥0

 P  th=1,h6=t−k α 2t−2h τ 2  f



2

( P∞

0

≤ inf exp{−τ u} exp

2h h0 =0 α f

2

τ ≥0

≤ inf exp{−τ u} exp{ τ ≥0

τ2 2(1 − α f2 )

}

τ

 ) 2

The Behavior of Forgetting Learning in Bidirectional Associative Memory 391

√ where u = α fk n. Clearly, the right-hand side becomes minimum when τ = u(1 − α f2 ). With such a value of τ ,

(

Prob(EUj,t−k ) ≤ exp −

(1 − α f2 )α f2k n

)

2

for j = 1, . . . , p. The probability Prob(EV i,t−k ) is less than

Lemma 4. (

exp −

(1 − α f2 )α f2k p

)

2

for i = 1, . . . , n. Proof of Lemma 4. The proof is similar to that of lemma 3. With lemmas 3 and 4, theorem 1 can be proved in the following way: We denote the probability that (Xt−k , Yt−k ) is a fixed point as P∗ . Then, ¡ ¢ P∗ = Prob EU1,t−k ∩ · · · ∩ EUp,t−k ∩ EV1,t−k ∩ · · · ∩ EVn,t−k ´ ³ = 1 − Prob EU1,t−k ∪ · · · ∪ EUp,t−k ∪ EV 1,t−k ∪ · · · ∪ EV n,t−k ´ ³ ´ ³ (2.2) ≥ 1 − pProb EU1,t−k − nProb EV 1,t−k . One should be aware that ³ ´p ³ ´n 1 − Prob(EV 1,t−k ) P∗ 6= 1 − Prob(EU1,t−k )

(2.3)

because the above events, EUj,t−k ’s and EVi,t−k ’s, are not mutually independent. With lemmas 3 and 4, it is easy to see that in the following condition the right-hand side of equation 2.2 tends to one as n tends to ∞ (also p → ∞ due to p = rn and r is a constant).   (1−α f2 )n (1−α f2 )p (1−α 2 ) min(n,p) log log 2 logf min(n,p) log 2 log n 2 log p   , . k < min  = 2 log α1f 2 log α1f 2 log α1f Thus, theorem 1 is obtained. From theorem 1, the most recent k library pair, (Xt−k , Yt−k ), can be stored as a fixed point with high probability if k<

log

(1−α f2 ) min(n,p) 2 log min(n,p) 2 log α1f

.

392

Chi Sing Leung and Lai Wan Chan

Figure 1: Typical plot of f (α f ) where α f ∈ (0, 1) and min(n, p) = 32.

Given p and n, let

f (α f ) =

log

(1−α f2 ) min(n,p) 2 log min(n,p)

2 log α1f

.

(2.4)

The most interesting point is how to choose the value of α f ∈ (0, 1) such that f (α f ) is maximal. We denote the value of α f , with which f (α f ) is maximal, as α f,max . Clearly, f (α f ) is a continuous function of α f ∈ (0, 1). Also, d f (α f ) |αf =0+ > 0 d αf and d f (α f ) |αf =1− < 0 . d αf Hence, f (α) has at least one local maximum when 0 < α f < 1. A typical plot of f (α f ) is shown in Figure 1. Using simple numerical method, we obtain Table 1, which summarizes the values of α f,max at different values of min(n, p). From the table, as min(n, p)

The Behavior of Forgetting Learning in Bidirectional Associative Memory 393 Table 1: Summary of α f,max and f (α f,max ) Values, Found from Numerical Method. min(n, p)

α f,max

f (α f,max )

16 32 64 128 256 512 1024 2048 4096 8192

0.613 0.742 0.837 0.902 0.943 0.9675 0.982 0.9900 0.9945 0.9970

0.6012 1.223 2.3453 4.3611 7.9967 14.599 26.673 48.906 90.079 166.721

Table 2: Summary of α f,max and f (α f,max ) Values, Found from Equation 2.5. min(n, p)

α f,max

f (α f,max )

16 32 64 128 256 512 1024 2048 4096 8192

0.2407 0.6412 0.8042 0.9810 0.9393 0.9663 0.9814 0.9989 0.9945 0.9970

0.35 1.13 2.29 4.33 7.98 14.59 26.67 48.91 90.08 166.72

increases, α f,max increases. Also, for large min(n, p), α f,max is nearly equal to one. Based on this phenomenon, we can obtain the approximation of α f,max : s α f,max ≈

¾ min(n, p) +1 . 1 − exp − log 2 log min(n, p) ½

(2.5)

Equation 2.5 gives us a guideline to choose the value of α f . Table 2 shows α f,max at different values of min(n, p) based on equation 2.5. From the two tables, equation 2.5 is a good approximation of α f,max when min(n, p) is large. Substituting equation 2.5 into theorem 1, we can obtain the following corollary.

394

Chi Sing Leung and Lai Wan Chan

Corollary 1.

Under the forgetting learning with s

α f,max ≈

½ 1 − exp − log

¾ min(n, p) +1 2 log min(n, p)

if a BAM is trained with t library pairs and k is less than min(n, p) , 2e log min(n, p)

(2.6)

then the probability of the (t − k)th library pair (Xt−k , Yt−k ) being a fixed point tends to one, as n → ∞, p → ∞. From corollary 1, the storage ability of the forgetting learning is similar to that of Kosko’s encoding method. With Kosko’s encoding method, the BAM can encode only up to min(n, p)/2 log min(n, p) library pairs. Further encoding of the new library pairs will damage the entire system. On the other hand, the forgetting learning can encode any number of library pairs and always keep the most recent min(n, p)/2e log min(n, p) library pairs in the model. Hence, the BAM with the forgetting learning is similar to shortterm memory, which always extracts the most recent information from the environment. 3 Simulation 3.1 Optimal Forgetting Constant. We have carried out a simulation to verify equation 2.5 and corollary 1. The dimensions are 512. We randomly generate the library pairs and use the forgetting rule to encode them with different forgetting constants. Figure 2 shows the percentage of the most recent k library pair being successfully stored. Note that k = 0 means the current library pair. Suppose that we use 90% as the threshold. From Figure 2, if the value of α f is too large (such as 0.990), all library pairs (both old and new items) may not be stored as fixed points. When we use a very small value of α f (such as 0.940), the number of the most recent items being stored as fixed points becomes small. In general, the case of α f = 0.9663 (based on Table 2) is better than other cases. Also, there is a sharply decreasing change for k > 14 when α f = 0.9663. This result is consistent with our theoretical work presented in Tables 1 and 2. 3.2 Comparison with Learning Within Bounds. Additionally, we have carried out another simulation to compare the performance of equation 1.4 with that of the learning within bounds (Hopfield 1982). Note that for the

The Behavior of Forgetting Learning in Bidirectional Associative Memory 395

Figure 2: The percentage of the most recent previous library pair being successfully stored. Here, error is not allowed in the retrieved items and n = p = 512. The forgetting rule is equation 1.4, and the forgetting constant is varied.

BAM, learning within bounds can be used only when p = n. To our knowledge, for the BAM the selection role about the constant A in the learning within bounds has not been reported when p 6= n. The dimensions are 512. For equation 1.4, the forgetting constant is based on equation 2.5. For learning within bounds, the constant A is 0.35, which is the optimal value in Hemmen et al. (1988). Figure 3 shows the percentage of the most recent k previous library pair being successfully stored when error is not allowed in the retrieved items. For equation 1.4, the BAM can store the recent 14 library pairs. However, for learning within bounds, the BAM can store only the recent 5 library pairs. Hence, the performance of equation 1.4 is much better than that of learning within bounds when error is not allowed in the retrieved items. Figure 4 shows the percentage of the most recent k previous library pair being successfully stored when 5% errors are allowed in the retrieved items. For equation 1.4, the BAM can store the recent 24 library pairs. For learning within bounds, the BAM can store the recent 27 library pairs. Hence, the performance of learning within bounds is only a little better than that of equation 1.4 when a small number of errors is allowed in the retrieved items.

396

Chi Sing Leung and Lai Wan Chan

Figure 3: The percentage of the most recent k previous library pair being successfully stored. Here, error is not allowed in the retrieved items and n = p = 512. Two forgetting rules are used: equation 1.4 and learning within bounds. The forgetting constant α f in equation 1.4 is based on equation 2.5. The constant A in the “learning within bounds” is 0.35, which is the optimal value suggested in Hemmen et al. (1988). From this simulation, the performance of equation 1.4 is much better than that of the “learning within bounds” when error is not allowed in the retrieved items.

Figure 4: The percentage of the most recent k previous library pair being successfully stored. Here, 5% errors in the retrieved items are allowed and n = p = 512. Two forgetting rules are used: equation 1.4 and “learning within bounds.” The forgetting constant α f in equation 1.4 is based on equation 2.5. The constant A in the “learning within bounds” is 0.35, which is the optimal value suggested in Hemmen et al. (1988). From this simulation, the performance of the learning within bounds is only a little better than that of equation 1.4 when a small number of errors are allowed in the retrieved items.

The Behavior of Forgetting Learning in Bidirectional Associative Memory 397

Figure 5: The percentage of the most recent k previous library pair being successfully stored when equation 1.4 is used. The precision is ±19.999.

3.2.1 Behavior Under Fixed-Point-Number Machine. The real-valued computation in equation 1.4 creates some difficulties in the implementation. Here, we investigate the behavior of equation 1.4 when the precision of the weights, as well as the precision of the computation, is limited. Three schemes of the precision are considered. The dimensions are 512. We randomly generate the library pairs and use the forgetting rule under limited precision to encode them. In the first scheme, the precision is ±19.999 (4 12 digits machine). Due to the limitation of the precision, α f is set to 0.966. This means that all computations (the intermediate and final results) in equation 1.4 are rounded to a fixedpoint number with three decimal places between ±19.999. The simulation result is shown in Figure 5. From Figure 5, the performance under this precision is quite similar to that under float-point number precision (see Fig. 2 also). The second precision is ±19.99 (3 12 digits machine), and α f is set to 0.96. The simulation result is shown in Figure 6. From Figure 6, the BAM can still store 14 recent library pairs when error is not allowed in the retrieved items. When 5% errors are allowed, it can store only about 22 recent library pairs. The third precision is ±9.9 (2 digits machine), and α f is set to 0.9. The simulation result is shown in Figure 7. From Figure 7, the performance of the BAM is greatly reduced, and it can store only about 10 recent library pairs. The main cause of such reduction comes from the incorrect α f value. From theorem 1, when α f is set to 0.9, the BAM can store about 10 recent library pairs.

398

Chi Sing Leung and Lai Wan Chan

Figure 6: The percentage of the most recent k previous library pair being successfully stored. Here, equation 1.4 is used. The precision is ±19.99.

Figure 7: The percentage of the most recent k previous library pair being successfully stored. Here, equation 1.4 is used. The precision is ±9.9.

To sum up, when sufficient precision is used (for n = p = 512, it is about 3 digits), corollary 1 still holds. When the precision limits the value of α f , we can use theorem 1 to predict the performance.

The Behavior of Forgetting Learning in Bidirectional Associative Memory 399

4 Magnitude of the Weights 4.1 The Worst Case Bound. Let wt,ji be the weight between the jth neuron in FY and the ith neuron in FX . After t library pairs have been encoded using equation 1.4, wt,ji =

t X

α ft−h xh,i yh,j .

(4.1)

h=1

In the worst case, xh,i = yh,j for all h (or xh,i = −yh,j for all h), |wt,ji | =

t X

α ft−h ≤

h=1

1 . 1 − αf

(4.2)

Clearly, when α f is a constant, the magnitude of wt,ji is upper bounded by O(1). If α f is selected based on equation 2.5, then |wt,ji | ≤

r 1−

=

1

n

min(n,p)

(4.3)

o

1 − exp − log 2 log min(n,p) + 1 1

1 2e log min(n,p) 2 min(n,p)

+

1 2e log min(n,p) 2 ) 8( min(n,p

+ high order terms

.

(4.4)

Hence, for large n and p (note that p = rn), µ |wt,ji | ≤ O

¶ n . log n

(4.5)

4.2 The Probabilistic Bound. If xh,i and yh,j are ±1 equiprobable independent random variables, wt,ji (based on equation 1.4) can be expressed as wt,ji =

t X

α ft−h zh ,

(4.6)

h=1

where zh ’s are ±1 equiprobable independent random variables. Clearly, the distribution of wt,ji is symmetric. Let η be a standard normal random variable with variance one. It is not difficult to show that E(exp{τ zh }) < E(exp{τ η}). With lemma 1, n o n o p p Prob |wt,ji | > A min(n, p) ≤ 2 inf exp −τ A min(n, p) τ ≥0

400

Chi Sing Leung and Lai Wan Chan

× exp

 P  th=1 α 2t−2h τ 2  f



2



n

o p ≤ 2 inf exp −τ A min(n, p) τ ≥0

) τ2 , × exp 2(1 − α f2 ) (

where A is a positive constant. The right-hand side becomes minimum when p τ = A min(n, p) (1 − α f2 ) . With such a value of τ , ) ( o (1 − α f2 )A2 min(n, p) p . (4.7) Prob |wt,ji | > A min(n, p) ≤ 2 exp − 2 n

p When α f is a constant, the probability Prob{|wt,ji |) > A min(n, p)} exponentially tends to zero as n → ∞. If α f is selected based on equation 2.5, o n o n p Prob |wt,ji |) > A min(n, p) ≤ 2 exp −eA2 log min(n, p) .

(4.8)

It also tends to zero as n → ∞. To sum up, the √ magnitude of the weight is upper bounded by p O( min(n, p)) = O( n) in the probabilistic sense when equation 1.4 is used. 5 Conclusion We have examined the statistical storage behavior of the BAM under the forgetting learning. Also, we have derived the formula for choosing the forgetting constant: s ¾ ½ min(n, p) +1 . α f,max ≈ 1 − exp − log 2 log min(n, p) With this value, the forgetting learning rule can encode any number of library pairs and always keep the most recent min(n, p) 2e log min(n, p) library pairs in the BAM. Computer simulations have been carried out to verify our theoretical work. The magnitude of the weights was also discussed.

The Behavior of Forgetting Learning in Bidirectional Associative Memory 401

The results presented here can be extended to a Hopfield network. By adopting the approach we set out, we can easily obtain the result in a Hopfield network by replacing min(n, p) with n in all of the equations in this article. Acknowledgments We are very thankful to the reviewers for their valuable suggestions for improving this article. References Forrest, B. M., and Wallace, D. J. 1991. Storage capacity and learning in Ising-Spin neural networks. In Models of Neural Networks, E. Domany, J. L. van Hemmen, and K. Schulten, eds. Springer-Verlag, Berlin. Haines, K., and Nielsen, R. H. 1988. A BAM with increased information storage capacity. Proc. of the 1988 IEEE Int. Conf. on Neural Networks, 181–190. Hemmen, J. L., Keller, G., and Kuhn, R. 1988. Forgetful memories. Europhys. Lett. 5, 663–668. Hemmen, J. L., and Kuhn, R. 1991. Collective phenomena in neural networks. In Models of Neural Networks. Springer-Verlag, Berlin. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. 79, 2554–2558. Kohonen, T. 1972. Correlation matrix memories. IEEE Trans. Comput. 21, 353– 359. Kosko, B. 1988. Bidirectional associative memories. IEEE Trans. on Syst., Man, and Cybern. 18, 49–60. Kosko, B. 1992. Neural Networks and Fuzzy Systems. Prentice Hall, Englewood Cliffs, NJ. Leung, C. S. 1993. Encoding method for bidirectional associative memory using projection on convex sets. IEEE Trans. on Neural Networks 4, 879–871. Leung, C. S., Chan, L. W., and Lai, E. 1995. Stability, capacity and statistical dynamics of second order bidirectional associative memory. IEEE Trans. Syst., Man, and Cybern. 25, 1414–1424. Nadal, J. P., Toulouse, G., Changeux, J. P., and Dehaene, S. 1986. Networks of formal neurons and memory palimpsests. Europhys. Lett. 1, 535–542. Palm, G. 1980. On associative memory. Biolog. Cybern. 36, 19–31. Parisi, G. 1986. A memory which forgets. J. Phys. A:Math. Gen. 19, L617–L620. Wang, Y. F., Cruz, J. B., and Mulligan, J. H. 1990. Two coding strategies for bidirectional associative memory. IEEE Trans. on Neural Networks 1, 81–92.

Received May 30, 1995; accepted April 25, 1996.

Communicated by W. Thomas Miller

Adaptive Encoding Strongly Improves Function Approximation with CMAC Martin Eldracher Alexander Staller Ren´e Pompl The Boston Consulting Group, Eine Gesellschaft Buergerlichen Rechts, Sendlinger Strasse 7, 80331 Muenchen, Germany

The Cerebellar Model Arithmetic Computer (CMAC) (Albus 1981) is well known as a good function approximator with local generalization abilities. Depending on the smoothness of the function to be approximated, the resolution as the smallest distinguishable part of the input domain plays a crucial role. If the binary quantizing functions in CMAC are dropped in favor of more general, continuous-valued functions, much better results in function approximation for smooth functions are obtained in shorter training time with less memory consumption. For functions with discontinuities, we obtain a further improvement by adapting the continuous encoding proposed in Eldracher and Geiger (1994) for difficult-to-approximate areas. Based on the already far better function approximation capability on continuous functions with a fixed topologically distributed encoding scheme in CMAC (Eldracher et al. 1994), we present the better results in learning a two-valued function with discontinuity using this adaptive topologically distributed encoding scheme in CMAC. 1 Introduction Hard-limiting coarse coding elements in CMAC (Albus 1981) cause artificial discontinuities in the stored function and the partial derivatives. To avoid this, An et al. (1991) propose so-called superspheres instead of binary units but do not present any results. Triangle-shaped activation functions (Nie and Linkens 1993) leave no discontinuities except undefined partial derivatives. For moderate memory consumption and quick initial learning with large receptive fields while keeping a reasonable resolution (i.e., a reasonable size of memory), Mischo (1992) scales memory access and learning with a function of the distance of the input vector to the receptive field of each weight. Using a gaussian function in principle this results in gaussianshaped receptive fields (other functions are possible). Eldracher et al. (1994) proposed a direct implementation of gaussian-shaped receptive fields, using a one-dimensional topologically distributed encoding (TDE) in CMAC Neural Computation 9, 403–417 (1997)

c 1997 Massachusetts Institute of Technology °

404

Martin Eldracher, Alexander Staller, and Ren´e Pompl

(Eldracher and Geiger 1994). Compared to binary CMAC, this yields much better results on smooth functions (Eldracher et al. 1994). However, there remain problems with discontinuities. The larger the derivatives of the function to approximate, the larger are the difficulties (given a fixed number of receptive fields). These difficulties could be remedied by introducing more encoding units, but computation time and memory consumption would increase, too. Therefore we keep the same number of encoding units but adapt their positions in order to concentrate units in difficult areas—those with still higher-than-average output errors. 2 General Model of CMAC The following description of CMAC is a generalization of Albus (1981) and Tolle and Ersu¨ (1992). Details can be found in Appendix A. A quantizing interval in Albus (1981) can be seen as binary-valued activation function actij of a mossy fiber j in input dimension i. A mossy fiber is positioned at µij , the middle of its quantizing interval. The distance 1µ between the positions µij and µij+1 of two adjacent mossy fibers defines CMAC’s resolution ε. In binary CMAC % mossy fibers are always activated for each input sample. Each dimension of the input should be encoded in a topologically distributed manner. Therefore each binary activation function is replaced by a normalized, gaussian-shaped activation function with mean µij . The variance σij defines the overlap between neighboring mossy fibers. For a local generalization, the activation function actij is defined to zero if it is smaller than a constant amin . In order to get the highest possible encoding accuracy, one must choose σ = 2k √ε with k = 2, 3, . . . (Kinder and Brauer 1993). This 2 is due to the fact that the encoding of the number that is encoded using a specific active mossy fiber is better, the steeper the flank of the gaussian (i.e., the larger the difference in the mossy fiber’s activation for a certain small difference in the input). With the cited σ , the different gaussians of neighboring mossy fibers intersect so that there is always some mossy fiber maximally steep (i.e., close to its maximum gradient). However, preliminary experiments showed that it is not necessary to choose this theoretically optimal σ , since the coding accuracy is by far enough, and basically σ only determines the overlap of the receptive fields of neighboring mossy fibers (i.e., the number of simultaneously active granule cells, or the generalization). Now % serves only to determine the number and positions of mossy fibers, given encoding interval [0, smax [ and resolution ε. Taking into account the modified mossy fibers, a granule cell v ∈ A is defined as in the original model. To generate a granule cell v, mossy fibers congruent to modulo % are combined such that each v is connected to exactly one mossy fiber from each input dimension. v is named uniquely by the tuple of numbers ji of the mossy fiber j in dimension i connected to v. The

Adaptive Encoding Strongly Improves Function Approximation

405

position µv of a granule cell v = (j1 , . . . , jn ) ∈ A is defined as the n-tuple of the positions of its mossy fibers: µv := (µ1j1 , . . . , µnjn ). The activation function av of a granule cell v is the product of the mossy fiber activation function, an n-dimensional gaussian. Each granule cell v ∈ A is provided with a weight wv ∈ R. Using only trained weights, output p is computed as the mean of the products of the granule cells’ activities and their weights. For learning we generalize a method from Tolle and Ersu¨ (1992). Given target output p∗ , first untrained weights are set to p∗ . Then the remaining error is evenly distributed over all weights, which are trained using delta rule with learning rate η (0 < η ≤ 1).

3 Adaptive Topologically Distributed Encoding The aim of adaptive topologically distributed encoding (ATDE) is to use fewer encoding units (i.e., mossy fibers for ATDE in CMAC) in order to use less memory and get quicker architectures with better classification ratios (Eldracher and Geiger 1994). This is achieved by moving mossy fibers into those parts of their input dimension where they are really needed instead of distributing them uniformly. As the positions µij of the mossy fibers change during the adaptation process, σ and ε = 1µ are no longer equal for all encoding units. As the left and right neighbors of a specific mossy fiber may have different distances 1µ, two different σ are used for each mossy fiber. After moving a mossy fiber, all σij are adapted in order to keep the highest possible encoding accuracy (see Section 2). Whenever an input pattern is presented to the network, the center of the most activated encoding unit bmu (best matching unit) in each input dimension is moved a bit closer to this data point. The topological neighbors are moved into the direction of the input sample too. Afterwards the widths σij have to be adapted in all input dimensions. The adaptation of the positions µij of the mossy fibers is based on the distance of the training pattern to µij and the output error. This implies that encoding units concentrate around patterns that caused high output errors during training. This yields smaller distances between the mossy fibers in these areas and therefore steeper flanks of the corresponding gaussians as σ depend on 1µ. As a result, a higher accuracy can be achieved in these areas, as well as the distinction of input patterns with totally different output values (e.g., at discontinuities). The displacement of the bmu is scaled with a fixed learning rate ηbmu . For the neighbors, we scale ηbmu with a function h(i), depending on the topological order over the mossy fibers. A mossy fiber has the number i if it is the ith mossy fiber on this input dimension with respect to the center’s µ. If the best matching unit is numbered bmu, we define the (discrete) neigh-

406

Martin Eldracher, Alexander Staller, and Ren´e Pompl

1 borhood function h(i) as h(i) = |i−bmu|+1 . Note that 0 < h(i) ≤ 1 (especially h(bmu) = 1), and h(i) is symmetrically decreasing around bmu. h(i) does not depend on the actual distance between the center of a mossy fiber and the training input. In order to achieve convergence in the mossy fiber placing, all learning rates are diminished exponentially in time with cooling factor δ (0 < δ < 1). Starting with δ 0 = 1 at epoch 1 (one training epoch equals a time step), the mossy fibers’ movement decreases exponentially every epoch by multiplication with δ (epoch−1) . Furthermore, we have to scale the mossy fibers’ moving in order to prevent single mossy fibers from overtaking each other, which would destroy the topological encoding. We do this by dividing the size of the movement by the maximum output error of CMAC, the difference of the smallest and largest output value, here that is smax = 1 (see Section 2 and Appendix A). Altogether for input s the positions µ of all mossy fibers falling in the N neighborhood of the best matching unit change according to

µ(new) := µ + (s − µ) · |p − p∗ | · h(i) · δ (epoch−1) /smax .

(3.1)

The learning of the weights remains as in standard CMAC, based on the activations before the positions of the mossy fibers are changed (i.e., because the output errors are computed for this mossy fiber distribution). This implies that there will be some artificial approximation errors after changing the mossy fibers’ positions. With small learning constants for the mossy fibers’ movement, this should not cause any negative effect. Especially as the changes in the positions of the mossy fibers get smaller and smaller using a cooling factor (see, e.g., Ritter et al. 1991), after some time the mossy fiber distributions remain constant. From this point, there is no more interplay between weight change and mossy fiber distribution change. Therefore the choice of the cooling factor is not critical. A cooling that is too rapid will result in a mossy fiber distribution close to the initial distribution; a cooling that is too slow might result in an elongated learning process but will never prevent learning. 4 Results Since we know that there remain some difficulties with TDE in CMAC to approximate discontinuous functions (Eldracher, Staller, and Pompl 1994), we use a discontinuous function EDGE (Tolle and Ersu¨ 1992) to show the capabilities of ATDE in CMAC: EDGE :

[0; 1]2 → [0; 1] ½ 0.75s1 EDGE(s1 , s2 ) = 1

if else

s1 < 0.6 ∧ s2 < 0.5

.

While EDGE is very easy to approximate with a binary CMAC since it is nearly everywhere constant or linear, EDGE is an example of the worst case

Adaptive Encoding Strongly Improves Function Approximation

407

Table 1: Approximation of the Function EDGE with Binary Activation Functions. %ε

%

2−2 2−3 2 − 4* 2−5

16 16 16 16

ε

2−6 2−7 2−8 2−9

η

0.1 0.5 1.0 1.0

1089 Test Patterns After First Epoch After Tenth Epoch Emean Esq Emax Emean Esq Emax 10.9% 14.2% 43.6% 5.8% 9.9% 51.6% 2.6% 8.6% 68.4% 6.9% 24.2% 100.0%

5.6% 8.9% 43.2% 4.2% 7.6% 48.9% 2.5% 8.0% 71.8% 6.9% 24.2% 100.0%

*The parameters in this row were used in Figure 1.

concerning approximation with TDE. To produce a constant output value using nonlinear, gaussian-shaped mossy fiber activation functions, different weights have to be chosen for each active granule cell depending on the relative position of a granule cell to the training pattern. This is different for a constant activation function as in original CMAC (Albus 1981), where the relative position does not have any influence, as well as for linear triangle-shaped activation functions (Nie and Linkens 1993), where in each dimension the loss of activation in one mossy fiber is exactly compensated by an increase in the neighboring mossy fiber’s activation. Consequently with TDE in CMAC, a weight cannot fit when test patterns different from the training patterns are used. Furthermore, an edge is difficult to model with relatively large, fixed-sized receptive fields. Both effects could be remedied. For the former problem, a higher number of training examples would help to adapt all weights better. For the latter problem, more encoding units could be used. We do not try to do this here, since we want to compare our approaches with exactly the same conditions. The training set consists of 1000 random samples. The test set contains 1089 samples on a regular mesh. For quality measurement, mean absolute error, mean squared error, and maximum absolute error are used. 4.1 Binary Encoding. For the binary encoding we need a smaller resolution ε to get good results. Since we want to use similar sizes %ε for the mossy fibers (and granule cells) in all experiments, we have to use a larger % = 16. However, this yields CMAC storages that use about four times more memory than comparable (A)TDE CMACs. Results with binary CMACs are shown in Table 1 and Figure 1. 4.2 Topologically Distributed Encoding. In order to get a coding-opti7 9 mal σ , we set all σ = 2− 2 , 2− 2 (two local optima of the encoding accuracy, yielding different-sized receptive fields; see Section 2). With values amin =

408

Martin Eldracher, Alexander Staller, and Ren´e Pompl

Figure 1: Binary CMAC: Approximation of EDGE (left) and remaining absolute error (right) for % = 16, ε = 2−8 , and η = 1.0. Table 2: Approximation of EDGE with TDE in CMAC with η = 0.5.

σ

%ε

− 72

2

2

−4

%

ε

−6

4 2

2− 2

7

2−5 4 2−7

9

2−4 4 2−6

2− 2

− 92

2

2

−5

−7

4 2

amin e−1 e−2 e−4 e−1 e−2 e−4 e−1 e−2 ∗ e−4 e−1 e−2 e−4

1089 Test Patterns After First Epoch After Tenth Epoch |A∗ | Emean Esq Emax Emean Esq Emax 47 5.9% 11.2% 87 6.5% 11.8% 168 6.3% 11.6% 184 5.9% 11.2% 339 6.5% 11.8% 651 6.3% 11.6% 12 5.6% 9.8% 23 5.0% 10.0% 47 4.9% 10.2% 48 5.1% 9.4% 90 4.8% 9.9% 184 4.9% 10.1%

52.7% 57.2% 56.4% 53.1% 57.5% 56.2% 55.5% 55.2% 58.3% 53.1% 54.3% 55.7%

5.3% 6.1% 5.6% 5.1% 6.1% 5.6% 3.8% 3.8% 3.7% 3.5% 3.8% 3.6%

10.0% 10.9% 10.5% 9.8% 10.9% 10.5% 8.8% 9.1% 9.1% 8.5% 9.1% 9.1%

56.9% 55.8% 57.3% 58.6% 56.5% 56.9% 60.4% 56.7% 61.2% 59.4% 55.9% 60.9%

*The parameters in this row were used in Figure 2.

√ e−4 , e√−2 , e−1 the circular receptive fields around µv have radii of 2 2σ, 2σ , and 2σ . Results with learning rate η = 0.5 are given in Table 2, which also gives the average number |A∗ | of active granule cells (for binary activation functions there would be |A∗ | = % = 4). Figure 2 shows an approximation result.

Adaptive Encoding Strongly Improves Function Approximation

409

Figure 2: CMAC with TDE: Approximation of EDGE (left) and remaining ab9 solute error (right) for σ = 2− 2 , % = 4, ε = 2−6 , amin = e−4 , and η = 0.5.

4.3 Adaptive Topologically Distributed Encoding. We fix % = 4 for an initial 1µinit = εinit = 2−6 , 2−7 . In order to get an optimal initial σinit , we set 7 9 σinit = 2− 2 , 2− 2 . Furthermore we choose amin = e−4 , e−2 , e−1 , and δ = 0.5. Because there might be a different optimal number N of neighbors to be adapted with the best matching unit, some preliminary tests are conducted. N = 40 gives the best results (the total number of mossy fibers for each dimension is 68). It seems to be sound to use a large N to approximate a discontinuous function, since for the approximation of the discontinuity, many mossy fibers have to be concentrated. A large number of neighbors to be adapted ensures a more global change of the overall distribution of the mossy fibers, making it easier to get mossy fibers really dense at certain positions. It is therefore not surprising that after adaptation and learning, mean and squared errors are lower for the adaptive continuous activation functions (compare Tables 2 and 3 and Figs. 2 and 3). However, the offline investigation for N was done only for academic reasons and showed that N can be varied within large ranges without harmful results. If a small N is chosen, the adaption process is only slower. Therefore decreasing the cooling to elongate the mossy fibers’ position change will allow reaching the same results with a smaller N. Figure 4 shows the final distribution of the mossy fibers for fixed (left) and adaptive (right) topologically distributed encoding after learning. Each line is drawn from a position of a mossy fiber. Positions of the granule cells lie on the intersection points of lines equal modulo % (cf. Section 2). Missing to model the cross-correlations between the coordinates directly in the encoding, the concentration of granule cells also rises in areas where this is not needed to approximate the function. But this small drawback is

410

Martin Eldracher, Alexander Staller, and Ren´e Pompl

Table 3: Approximation of EDGE with ATDE in CMAC, η = 0.5, N = 40, and δ = 0.5.

σinit

%εinit

− 72

amin

|A∗ |

e−1

24

3.2%

7.6%

44.6%

2.7%

7.1%

61.3%

e−2 e−4 e−1

43 79 138

4.4% 4.8% 4.9%

9.1% 9.4% 10.1%

43.4% 46.3% 62.6%

3.5% 3.5% 3.9%

8.2% 8.2% 9.4%

54.3% 55.1% 58.7%

2−7

e−2 e−4 e−1

264 498 8

5.7% 5.6% 6.8%

10.9% 10.5% 13.8%

64.9% 61.1% 81.7%

4.5% 4.5% 2.1%

9.8% 9.5% 6.2%

64.5% 60.9% 64.4%

4

2−6

e−2 ∗ e −4 −1 e

13 23 34

4.1% 3.2% 5.7%

8.5% 7.4% 10.6%

50.4% 53.0% 58.2%

1.8% 1.9% 2.7%

6.2% 6.5% 7.4%

68.7% 73.7% 64.8%

4

2−7

e−2 e−4

69 139

4.2% 4.2%

8.4% 8.5%

47.7% 43.6%

2.6% 2.6%

7.8% 7.7%

59.4% 57.9%

%

εinit

2−4

4

2−6

2− 2

7

2−5

4

2− 2

9

2−4

9

2−5

2

2− 2

1089 Test Patterns After First Epoch After Tenth Epoch Emean Esq Emax Emean Esq Emax

*The parameters in this row were used in Figure 3.

Figure 3: CMAC with ATDE: Approximation of EDGE (left) and remaining 9 absolute error (right) for σinit = 2− 2 , % = 4, εinit = 2−6 , amin = e−4 , η = 0.5, N = 40, and δ = 0.5.

rewarded with very quick algorithms, based purely on CMAC. If in some application the training patterns do not cover the whole input space (i.e., the state space of a process is a diagonal trajectory) this drawback is reduced by implementing hash coding for the granule cells (Tolle and Ersu¨ 1992). Hash-coding memory is allocated only for granule cells that cover areas

Adaptive Encoding Strongly Improves Function Approximation

411

Figure 4: Granule cells are positioned at intersection points for fixed (left) and adaptive (right) topologically distributed encoding after training.

with training data from the process’s state space. Therefore, even for highdimensional state spaces, often only some memory for a (possibly very small) hash table is consumed. In the case of a diagonal state space, the adaptive coding scheme basically results in different distances of the granule cells along this diagonal. Similar considerations are valid for different state space trajectories. Due to hash coding, the addressing of the granule cells is exactly the same as in the original CMAC. However, possible hardware implementations may become much less efficient using hash coding. 5 Discussion CMAC is a function approximator because no data point is modeled exactly, as necessary for function interpolation. Rather, CMAC tries to minimize the overall error of all data, that is, approximate the function. Even using a learning rate of η = 1 will not yield a function interpolation; due to the generalization of close training points, this exact match after training a particular input pattern will be altered with ongoing training. However, CMAC is a local function approximator not only because the generalization is only local, but also because there is no extrapolation over the learned granule cells, which have a certain limited area of influence. Gaussian-shaped, N-dimensional receptive fields for the granule cells are a natural result of one-dimensional gaussian activation functions for the mossy fibers of the single input dimensions. Using TDE (see Appendix B) in CMAC therefore yields well-defined partial derivatives for the approximated function. After ten epochs training with TDE in CMAC, the errors for smooth, continuous functions such as two-dimensional sine or cosine functions can be reduced to less than 30% of the smallest error with binary CMAC without any parameter fine tuning, while even after the first epoch,

412

Martin Eldracher, Alexander Staller, and Ren´e Pompl

the same result as for the best binary CMAC is reached (Eldracher et al. 1994). Also, the parameter-dependent variations of the error after training (i.e., the variations in the quality of the function approximation choosing different parameter settings) are much smaller compared to binary encoding in CMAC (Eldracher et al. 1994). Furthermore, with TDE in CMAC, we can use a smaller % for better results, which yields a linearly smaller number of mossy fibers in each input dimension. (Remember that % determines only the number and positions of mossy fibers—compare Section 2—while the generalization is now determined by parameter σ . Keeping the size %ε of the granule cells constant while decreasing % demands an increased resolution ε or generally larger distances 1µ between mossy fibers, thus decreasing the number of mossy fibers.) Considering the exponential growth of the number of granule cells with an increasing number of input dimensions, this effect should not be underestimated. Even considering that for each mossy fiber we now have to store new parameters for centers µ and variances σ , we save memory, since this is a linear increase of memory consumption opposite to the exponential decrease of weights for the granule cells due to the possible changes in % and ε. Therefore the memory savings become even better for (A)TDE CMAC when approximating higher-dimensional functions. ATDE’s direct moving of the positions µij of the mossy fibers is much more straightforward than moving the granule cells, as proposed in Hormel (1989). But opposite to Hormel (1989) ATDE does not directly model the cross-correlations between the coordinates and thus may concentrate granule cells in areas where this is not needed to approximate the function. This little drawback is rewarded with very quick, purely CMAC-based algorithms. Furthermore superfluous cells often are automatically omitted using hash coding for the granule cells within the standard CMAC concept (Tolle and Ersu¨ 1992). Opposite to Hormel (1989), ATDE adapts not only to the input patterns’ distribution but also to the remaining output error. Furthermore, inserting more encoding units in ATDE following Fritzke (1993) is easily possible (Eldracher and Geiger 1994) if a better approximation is needed. Results concerning insertion of mossy fibers will be reported elsewhere (but are available in an internal report written in German). Compared to binary encoding or fixed TDE, with ATDE in CMAC the lowest errors are obtained while preserving the advantages of fixed TDE: less consumed computing power and memory and low-parameter dependent variations. Our approach is very quick. Ten epochs with 1000 training samples take only around half a minute (on an HP Series 9000 Model 700 in multiuser operation), and good results are obtained after just the first epoch. Since in our implementation binary CMAC is about three times quicker per epoch than ATDE CMAC (2.7 times quicker than TDE CMAC), the overall speed is better for ATDE CMAC. With our absolute computation times, we are much quicker than many other adaptive coding schemes (e.g., Ritter et al. 1991; Fritzke 1993). If even faster algorithms should be necessary in real-time applications, we must keep in mind that with a quick hardware implementa-

Adaptive Encoding Strongly Improves Function Approximation

413

tion (with easy addressing but without hash coding) ATDE CMAC possibly wastes a lot of memory for sparsely populated state spaces, because it cannot model the cross-correlations of the single input dimensions. However, a binary CMAC would do the same or even worse due to a higher number of granule cells. The results show that the difficulty of modeling constant functions with TDE (see Section 4) is only a theoretical drawback, since, using ATDE, even constant functions are modeled better than using binary CMAC. Besides, the increase in the approximation error from binary CMAC to a CMAC with TDE is only marginal, whereas smoother functions are modeled much better with TDE in CMAC than with a binary CMAC (for detailed examples see Eldracher et al. 1994). The parameter N for the number of neighboring mossy fibers to be adapted simultaneously is not critical. However, if it is known that a discontinuous function should be approximated, it is better to choose larger values for N. As an application in control, we currently investigate pole balancing. This standard benchmark problem is especially interesting, since a nearly diagonal state space must be approximated. Starting from the ASE/ACE model (Barto et al. 1983) we changed the linear perceptrons of ACE and ACE into CMACs and also got better results using less memory (Lin and Kim 1991). However, using TDE in CMAC increases the learning convergence. Although each trial is about three times slower, we get an absolute speed-up of about three, since we need about ten times fewer trials until the pole is balanced in each epoch. For our first experiments, we used a CMAC with TDE that consumed more memory than the binary CMAC, since values for µ and σ had to be stored for each mossy fiber. Using ATDE CMAC we can get similar results but with further reduced memory consumption (due to reduced % with equal %ε). However, these experiments are not yet finished; a thorough examination of the expected benefits of ATDE in CMAC for control applications will be reported in detail elsewhere. Appendix A: Detailed General Model of CMAC CMAC can be defined as a number of mappings from input Es ∈ S to mossy fibers m ∈ M, further to granule cells a ∈ A, and finally to the output r ∈ R (Albus 1981). To simplify the description, we assume si to be normalized to si ∈ [0, smax [ ∀i ∈ In := {1, 2, . . . , n} and a one-dimensional output space. For higher-dimensional output functions, the same mappings are used, apart from A → R, which is different for each component of the output vector. A.1 Mapping Input to Output: S → M → A → R. Figure 5 shows the mappings to be described in the following subsections. A.1.1 Mapping S → M. Each component si of the N-dimensional input vector Es ∈ S is encoded separately by the set Mi = {1, 2, . . . , nM } of mossy

414

Martin Eldracher, Alexander Staller, and Ren´e Pompl

Figure 5: Example for encoding in a binary-valued, original CMAC for twodimensional input space, % = 4, ε = 0.25, and encoding values in [0, 5.25[2 .

fibers. Each mossy fiber j ∈ Mi is defined through its position µij and its activation function actij . To define µij , the interval [0, smax [ is divided into R subintervals of length ε, the resolution of CMAC. For each mossy fiber j ∈ Mi an interval Iij , is defined: Iij := [jε − %ε; jε[, where % ∈ 1, 2, . . . . The positions are defined as the centers of the intervals Iij : µij := jε − %ε 2 . As each number in [0, smax [ is contained in exactly % consecutive intervals, % specifies the amount of generalization. We use a normalized gaussian for actij with mean µij for each mossy fiber j ∈ Mi . The variance σij defines the overlap between neighboring mossy fibers. For a local generalization, actij is defined to be zero if it is smaller than a constant amin : ½ actij (si ) := with

gaussij (si ) 0

if else Ã

gaussij (si ) ≥ amin

1 gaussij (si ) = exp − 2

µ

si − µij σij

¶2 ! .

(A.1)

Adaptive Encoding Strongly Improves Function Approximation

415

A mossy fiber is active if its activation is different from zero. Each input coordinate si is encoded by the set of active mossy fibers M∗i := {j ∈ Mi | actij (si ) > 0}. For every input component si all activations of the mossy fibers j ∈ Mi can be summed up in a row vector of nM elements. Written one beneath the other, these vectors form a n × nM matrix M. If M is the set of the real n × nM matrices, S → M can be written as Es 7→ M = (mij )i∈In ,j∈{1,...,nM }

with

mij = actij (si ).

(A.2)

A.1.2 Mapping M → A. Each granule cell v ∈ A is defined by its position µv and its activation function av and named uniquely by the tuple of numbers ji of the mossy fibers j in dimension i connected to v. To generate a granule cell v mossy fibers’ congruent modulo % are combined, so that v is connected to exactly one mossy fiber from each input dimension. It holds: A ⊆ M1 × · · · × Mn . The position µv of a granule cell v = (j1 , . . . , jn ) ∈ A is defined as the ntuple of the positions of the mossy fibers connected to v: µv := (µ1j1 , . . . , µnjn ). The activation function av of a granule cell v is the product of the mossy fiber activation functions, an n-dimensional gaussian. A granule cell becomes active only if all connected mossy fibers are active. The input vector Es is encoded by the set of active granule cells A∗ ⊆ M∗1 × · · · × M∗n . The mapping M → A maps the matrix M to the vector aE of activations av of all granule cells v ∈ A. A.1.3 Mapping A → R. Each granule cell v ∈ A is provided with a E With weight wv ∈ R. The weights wv are arranged in the weight vector w. respect to the training indicator αv ∈ {0, 1} for wv , output p is computed using only trained weights: p = P

X

1

v∈A∗

αv av

αv av wv .

(A.3)

v∈A∗

A.2 Learning Algorithm. Given target output p∗ and learning rate η(0 < η ≤ 1), the following calculations are performed for each weight wv with v ∈ A∗ : Ã ! X X X av 2 ∗ av , 1wv = αu au p − αu au wu asquare = asquare u∈A∗ v∈A∗ u∈A∗ ½ ∗ p + η1wv if wv is not trained (new) = wv (A.4) wv + η1wv if wv is trained . First, untrained weights are set to target p∗ . Then the remaining error is evenly distributed over all weights, so changes in already trained weights are kept as small as possible to get a smooth interpolation (Tolle and Ersu¨ 1992).

416

Martin Eldracher, Alexander Staller, and Ren´e Pompl

Figure 6: Topologically distributed encoding: x-axis: numerical value sS, y-axis: receptive fields (gaussians) and activations for a specific value s1 (squares).

Appendix B: Topologically Distributed Encoding In topologically distributed encoding (Geiger 1990; Eldracher and Waschulzik 1993) for every dimension of the input space S ∈ [0, smax [ we use a number of encoding units ordered along a one-dimensional axis. To encode a value s1 , the units are activated according to their overlapping gaussianshaped receptive fields. Since several units are always active, numerical information is encoded in topologically distributed fashion (see Fig. 6). Each receptive field is specified by its center µ and standard deviation √ σ . For the highest possible accuracy of the encoding, we chose σ = 2k 1µ 2 with k = 2, 3, . . . . 1µ denotes the difference between two adjacent centers of receptive fields (i.e., the mossy fibers). As in topologically distributed encoding, all 1µ are equal; one can write ε := 1µ to close the gap to CMAC in Section A.1.1. The resolution depends on the change in the encoding with respect to the change in the input. Thus more encoding units (with steeper sides of the gaussians having closer centers) help to increase the achievable resolution. References Albus, J. S. 1981. Brains, Behaviour, and Robotics. Byte Books. An, P. C. E., Miller III, W. T., and Parks, P. C. 1991. Design improvements in associative memories for cerebellar model articulation controllers (CMAC). In Artificial Neural Networks. Proceedings of the 1991 International Conference on Artificial Neural Networks (ICANN-91), T. Kohonen, K. M¨akisara, O. Simula, and J. Kangas, eds. Elsevier Publishers B.V., North-Holland, Amsterdam. Barto, A. G., Sutton, R. S., and Anderson, C. W. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, 834–846.

Adaptive Encoding Strongly Improves Function Approximation

417

Eldracher, M., and Geiger, H. 1994. Adaptive topologically distributed encoding. In Proceedings of the International Conference on Artificial Neural Networks (ICANN-94), M. Marinaro and P. G. Morasso, eds., 771–774. Eldracher, M., Staller, A., and Pompl, R. 1994. Function approximation with continuous valued activation functions in CMAC. Forschungsberichte Kunstliche ¨ Intelligenz FKI-199-94, Institut fur ¨ Informatik, Technische Universit¨at Munchen. ¨ Eldracher, M., and Waschulzik, T. 1993. A topologically distributed encoding to facilitate learning. Journal of Systems Engineering 110–119. Fritzke, B. 1993. Growing cell structures—a self-organizing network for unsupervised and supervised learning. Tech. Rep. TR–93-026, International Computer Science Institute, Berkeley, CA. Geiger, H. 1990. Storing and processing information in connectionist systems. In Advanced Neural Computers, R. Eckmiller, ed., pp. 271–277. Elsevier Science Publishers B.V., Amsterdam. Hormel, M. 1989. A self-organizing associative memory system for control applications. In Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 332–339. Morgan Kaufmann, San Mateo, CA. Kinder, M., and Brauer, W. 1993. Classification of trajectories—extracting invariants with a neural network. Neural Networks 6(7), 1011–1017. Lin, C., and Kim, H. 1991. Use of CMAC neural networks in reinforcement self-learning control. In Artificial Neural Networks. Proceedings of the 1991 International Conference on Artificial Neural Networks, ICANN-91, T. Kohonen, K. M¨akisara, O. Simula, and J. Kangas, eds., pp. 1285–1288. Elsevier Publishers B.V., North-Holland, Amsterdam. Mischo, W. S. 1992. Receptive fields for CMAC: An efficient approach. In Artificial Neural Networks II, Proceedings of the International Conference on Artificial Neural Networks, ICANN-92. I. Alexander and J. G. Taylor, eds. Elsevier, Amsterdam. Nie, J., and Linkens, D. A. 1993. A fuzzyfied CMAC self-learning controller. In Proceedings of the FUZZ IEEE-93, S. IEEE, 500–505. Ritter, H., Martinez, T., and Schulten, K. (eds.) 1991. Neuronale Netze: Eine ¨ Einfuhrung in die Neuroinformatik selbstorganisierender Netzwerke. AddisonWesley, Reading, MA. Tolle, H., and Ersu, ¨ E. 1992. Neurocontrol. Learning Control Systems Inspired by Neuronal Architectures and Human Problem Solving Strategies, volume 172 of Lecture Notes in Control and Information Sciences. Springer-Verlag, Berlin.

Received January 9, 1996; accepted May 20, 1996.

Communicated by Misha Mahowald

An Analog Memory Circuit for Spiking Silicon Neurons John G. Elias David P. M. Northmore Wayne Westerman Departments of Electrical Engineering and Psychology, University of Delaware, Newark, DE 19716 USA

A simple circuit is described that functions as an analog memory whose state and dynamics are directly controlled by pulsatile inputs. The circuit has been incorporated into a silicon neuron with a spatially extensive dendritic tree as a means of controlling the spike firing threshold of an integrate-and-fire soma. Spiking activity generated by the neuron itself and by other units in a network can thereby regulate the neuron’s excitability over time periods ranging from milliseconds to many minutes. Experimental results are presented showing applications to temporal edge sharpening, bistable behavior, and a network that learns in the manner of classical conditioning. 1 Introduction Neuromorphic engineers and other researchers interested in fabricating artificial neurons have long sought a simple and area-efficient on-chip analog memory device. Various types of analog memory devices have been used to store synaptic weights (e.g., Shoemaker et al. 1988; Holler et al. 1989; Lee et al. 1991; Murray and Tarassenko 1994; Hasler et al. 1995), provide offset correction in silicon retinas (Mead, 1989), represent calcium concentration in silicon neurons (Mahowald and Douglas 1991), hold state in artificial dendrites (Elias and Northmore 1995), and set conductance parameters for voltage-sensitive channels (Douglas and Mahowald 1995). Other applications of analog memory include setting the threshold and integration time constant in integrate-and-fire soma and the setting of dynamics and conductance parameters in silicon dendrites. Many of these applications may be best served with memory devices that respond rapidly to network activity and thus need to retain state for only short periods of time. As neuromorphic engineers draw closer to realizing accurate hardware models of neurons, the need for a simple analog memory device grows correspondingly. Before discussing the state of the art in integrated analog memory devices, we provide some points of reference to which existing designs can be compared. We therefore define an ideal analog memory device as one that has the following attributes: fabricated with standard integrated circuit Neural Computation 9, 419–440 (1997)

c 1997 Massachusetts Institute of Technology °

420

John G. Elias, David P. M. Northmore, Wayne Westerman

(IC) processing methods, nonvolatile storage, high precision, wide dynamic range, rapid and easy programmability, small footprint, and operable from a single power supply. There may be other desirable characteristics, but these few represent the bulk of what is needed and what is typically difficult to achieve. The capacitor is the basis for most IC analog memory, as well as for several forms of what is generally regarded as digital memory. An ideal capacitor, once charged, retains its state indefinitely. It is free of voltage, temperature, and frequency effects, and its state is easily changed over a wide range. Of all integrated devices, the simple capacitor comes closest to achieving ideal behavior. Various types of integrated planar capacitors are easily fabricated using the metal-oxide-semiconductor (MOS) standard fabrication process available through the MOS integration service (MOSIS). Foremost among the various types of integrated planar capacitors are the poly-oxide-poly and poly-oxide-substrate forms, with the former having more ideal characteristics. Because of the extreme purity of the oxide layer used to form the dielectric in these types of capacitors, they can hold their state for long periods of time. For poly-oxide-substrate capacitors, the charge leakage is so low that significant amounts of information can be retained for decades (Advanced Micro Devices 1995). When integrated capacitors are connected to other IC elements such as the drain or source terminals of a MOS field effect transistor (MOSFET), their charge retention times are greatly reduced because of unavoidable charge leakage pathways that exist in the additional circuit elements. With MOSFETs, the charge leakage is primarily through reversed-biased PN junctions. Although small in magnitude, the leakage can still be large enough to drain a small capacitor of its charge in a matter of seconds. Capacitive-based digital memory such as dynamic random access memory (DRAM) therefore needs to be refreshed (recharged) periodically to maintain state. Analog memory may have to be refreshed as well. And since a precise voltage level might be required, the refresh process may present implementation difficulties. 1.1 CAPS-and-DAC. Of the three analog memories discussed in this paper, the simplest to understand is an array of capacitors recharged by a digital-to-analog converter (CAPS-and-DAC), whose basic architecture is shown in Figure 1a. The voltage on any one capacitor is precisely set by a digital-to-analog converter (DAC) and de-multiplexer. Address lines select the memory location to be set, and data lines to the DAC encode the desired voltage. Since there are unavoidable leakage paths, the voltage of each memory must be refreshed periodically by using digitally encoded states, typically stored off-chip in conventional digital memory (SRAM or DRAM). The de-multiplexer circuitry is normally integrated along with the capacitor array, but the DAC could be located off-chip to save space and reduce pin count.

An Analog Memory Circuit for Spiking Silicon Neurons

421

Figure 1: (a) Basic architecture for CAPS-and-DAC analog memory. An n-bit DAC is used to refresh capacitor voltages through a switching network. A kbit address applied to the de-mutiplexer connects a particular capacitor in the array to the DAC output, which sets its state to the desired value. (b) Plot of the minimum refresh frequency needed to maintain a desired level of precision with CAPS-and-DAC analog memory when the decay time constant is 10 sec. (c) Basic architecture for an array of floating gates. A k-bit address selects a particular floating gate for electron removal or electron addition. The direction and rate of charge flow during programming depend on the sign and magnitude of the high voltage (HV) inputs.

422

John G. Elias, David P. M. Northmore, Wayne Westerman

The rate of voltage decay due to charge leakage in CAPS-and-DAC memory depends on temperature, leakage pathway characteristics, and the capacity of each capacitor. The decay rate measured as a percentage of initial state is independent of initial voltage. Therefore, to maintain state to a certain precision requires that refreshing occurs frequently enough to keep the associated voltage ripple, expressed as a binary fraction of the full scale swing, less than 0.5 of a least significant bit (LSB). If we assume the decay is a single exponential, then the refresh frequency needed to maintain state with a ripple less than 0.5 LSB is given by f =

−1.0 h τ · log 1 − 2V

Vf s N 0 ·(2 −1)

i,

(1.1)

where τ is the time constant for voltage decay, V f s is the full-scale voltage, V0 is the desired state, and N is the number of bits of precision. The worst case (the highest required refresh frequency) occurs when the desired state is equal to V f s . Figure 1b is a plot of the worst-case refresh frequency needed for various values of N, when the time constant for decay is 10 seconds. With this time constant, a 200 Hz refresh rate is needed to maintain a precision of 10 bits, while only a 12 Hz rate is needed to maintain 6 bits of precision. In addition to the voltage ripple caused by the continuous decay and recharge of the capacitor, there are other noise effects having to do with connecting the target capacitor to the voltage source during setting and refreshing operations. These effects, generally called clock feedthrough (e.g., Sheu and Hu 1983; Wilson et al. 1985), can be significant depending on the design of the switching circuitry, the input voltage, and the size of the memory capacitance. The noise voltage generated is due to two effects: (1) capacitive coupling between the gate and drain electrodes that allows a portion of the gate signal to couple into the holding capacitor and (2) mobile charge left in the channel between drain and source electrodes when the transistor is turned off. A portion of this charge flows to the holding capacitor, thus changing its state. 1.2 Floating Gates. This type of analog memory (Frohman-Bentchkowsky 1971) has several desirable characteristics: its charge retention time is extremely long, a memory cell requires only a single minimum-size transistor, and it is possible to set its state precisely. In these devices, there are essentially no pathways through which charge can flow because the capacitor serving as the memory is completely isolated from all other circuits. The capacitor is also the gate of a MOSFET and therefore, directly affects the transistor’s operation. The state of the capacitor can be set using Fowler-Nordheim tunneling (Lenzlinger and Snow 1969) or, in some implementations, using both tunneling and hot electron injection (Diori et al. 1995). Both methods require elevated voltages and a relatively long time to set the desired state compared to CAPS-and-DAC memory. In general, the

An Analog Memory Circuit for Spiking Silicon Neurons

423

charging rate of the capacitor (floating gate) is not well known. Therefore, the accuracy and precision to which the memory can be set is limited unless a feedback technique is used (Diori et al. 1995). Floating gate memory devices were proposed for use in artificial neural networks by Alspector (1987) and were integrated as a large array of over 10,000 memory elements by Holler et al. (1989). Figure 1c shows the basic architecture of a floating gate memory array in which each memory element needs to be accessed only once to set its state, which can be stable for years. Despite the many positive attributes of CAPS-and-DAC and floating gate analog memories, neither is very suitable as a network-dependent shortterm memory element in artificial nervous systems. (By short term, we mean time scales of tens of microseconds to hundreds of seconds.) To make CAPS-and-DAC memory devices network-dependent would require constant monitoring of network activity and computing a new state for each memory cell. This could potentially be done using a dedicated processor, but it would add considerably to system complexity. As a long-term adaptive memory element, devices based on floating gates are unequaled (Hasler et al. 1995), but they require high voltages for programming, and their dynamics depend significantly on these voltage levels. (By dynamics, we mean the rate of change in the memory state.) In practice, it may be difficult to change floating gate memory dynamics on an individual basis, since each would require a different high-voltage level. In our work, we sought a memory device that could be fabricated in arrays, that required no high-voltage supplies, and whose dynamics and state were easily set by pulsatile network activity. To this end, we developed a simple four-transistor analog memory circuit, which we call a flux capacitor. 2 Flux Capacitor The flux capacitor is based on techniques used with CAPS-and-DAC analog memory and switched capacitors (Allen and Sanchez-Sinencio 1984). It uses a conventional MOS capacitor to hold charge, but its state is set by two opposing local synapses: an upper that increases the capacitor voltage and a lower that decreases it. A diagram of the circuit is shown in Figure 2a. On every upper (Su ) synaptic activation, the holding capacitor, Ch , receives a small packet of charge from the upper capacitor, Cu . And with every lower (SL ) synaptic activation, Ch gives up a similarly small packet of charge to the lower capacitor, CL . Therefore, to set the flux capacitor’s state to any desired level requires a sequence of upper or lower synaptic activations. The effect that each activation has on the state of the flux capacitor is determined by the synaptic weight, which is set by the size of the synapse capacitors, Cu or CL , relative to the holding capacitor, Ch . Each synapse (see Fig. 2b) has three components: two MOS transistors that are series connected, drain to source, and a small capacitor that connects between their common node and ground. Lower synapses use n-channel

424

John G. Elias, David P. M. Northmore, Wayne Westerman

Figure 2: (a) Basic circuit for flux capacitor analog memory. Su and SL are the upper and lower synapses, respectively; Vu and VL are the corresponding synaptic supply potentials; state is held on Ch ; Cu and CL relative to Ch are the effective synaptic weights. Each synapse is made up of two MOS transistors in series. (b) Synapse circuit schematic and VLSI layout. Only two MOS transistors make up each synapse. The physical layout is minimum size and makes use of the parasitic diffusion capacitance between the transistors to affect Cu or CL . (c) Measured flux capacitor output voltage as a function of desired voltage computed from equation 2.4.

transistors; upper use p-channel. The two transistors of a particular synapse never turn on at the same time. Synaptic activation proceeds by first briefly (∼ 50 nsec) turning on the transistor closest to either of the supply voltage sources (Vu or VL ) to charge the synapse capacitor to that potential. This is followed by briefly turning on the transistor nearest the holding capacitor, which either adds or removes a certain amount of charge from Ch . The state and dynamics of the memory are set by the average charge flux entering and leaving the circuit through its synapses. The synapses

An Analog Memory Circuit for Spiking Silicon Neurons

425

are directly activated by afferent pulsatile signals coming from multiple sources (neuromorphs) anywhere in a network. A flux capacitor analog memory can represent transient or sustained information concerning the system, and, with the appropriate connections, can respond to a wide range of network activities. Several examples of its utility as a dynamic memory for spike firing threshold will be presented later. 2.1 Statics. The steady-state (average) flux capacitor voltage is given by V=

fu · Wu · Vu + fL · WL · VL + g · Vg , fu · Wu + fL · WL + g

(2.1)

where fu and fL are, respectively, the average frequencies at which the upper and lower synapses are activated, Wu and WL are the respective synaptic weights, and g accounts for the state decay due to charge leaking away to voltage source Vg . Here we make the assumption that the charge leakage between synaptic activations can be represented by a single-exponential function with time-constant RCh , where R represents the resistance of the various unwanted charge leakage pathways. The average period between synaptic activations is T. Equation 2.1 can also be solved for the ratio of upper to lower rates of synaptic activation given the desired voltage, V: WL · (V − VL ) + g · T · (V − Vg ) fu . = fL Wu · (V − Vu ) + g · T · (V − Vg )

(2.2)

Definitions for the various parameters are: Upper Weight Lower Weight

Wu =

Cu Cu +Ch

WL =

CL CL +Ch

Decay Factor

g=

1 T

³ −T ´ · 1 − e RCh

Average Synapse Activation Period

T=

1 fL + fu

(2.3)

.

Experimentally, RCh time constants have been found to be fairly long (∼700 sec with a Ch = 0.5 pF) compared to synaptic activation frequencies that, in our system, typically range between 0.1 and 500 activations per second. Equation 2.1 simplifies considerably under easily arranged conditions: equal synaptic weights (CL = Cu ), lower-supply voltage, VL , and leakage voltage, Vg , are zero; and the decay time constant, RCh , is much larger than the period between synaptic activations, T, V=

Vu 1+

fL fu

.

(2.4)

Figure 2c shows the static output of a flux capacitor as a function of the desired voltage computed from equation 2.4. The desired voltage determined

426

John G. Elias, David P. M. Northmore, Wayne Westerman

the activation frequency ratio of lower to upper synapses, which was then applied to the flux capacitor. To enable measurement of flux capacitor state, we used an on-chip source follower to buffer its output. Therefore, all reported measurements are approximately 1 volt lower than the actual flux capacitor voltage. In addition, memory voltages less than 1 volt could not be measured because of the limited lower range of the source-follower buffer. 2.2 Noise. The voltage on the holding capacitor, Ch , is always in a state of flux due to discrete synaptic events. Therefore the steady-state capacitor voltage will have a corresponding noise component. The magnitude of this noise will determine the precision to which the state of the flux capacitor can be set. The change in flux capacitor state due to each synaptic activation is given by 1Vu = (Wu + SNu ) · (Vu − Vh ) 1VL = (WL + SNL ) · (VL − Vh ),

(2.5)

where Vh is the state before the synapse is activated, 1Vu is the change in voltage due to activating an upper synapse, 1VL is the change due to activating a lower synapse, and SNx is a constant representing the charge injection due to switching (i.e., the clock feedthrough effect). The voltage step depends on the current state, the supply potential, the synapse weight, and the charge injection factor. Henceforth, references to flux capacitor synaptic weights will include the contribution due to charge injection. As with real synapses (e.g., Shepherd and Koch 1990) the change in membrane state due to synaptic activation is dependent on the instantaneous voltage of the local membrane and the synaptic conductance (weight). If the synaptic weights are equal, the maximum peak-to-peak noise due to individual synaptic activations is Vpp = 1Vu − 1VL = W · (Vu − VL ),

(2.6)

and the precision to which a particular state can be set in terms of bits is given by N = − log

Vpp 1 = log Vu − VL W

base 2 .

(2.7)

Therefore, for 8-bit precision, the holding capacitor must be at least 256 times larger than either synapse capacitor. Of course, the relative size of the capacitors will depend on the application. In our work, we have fabricated and tested flux capacitors that have Cu /Ch ratios between 100 and 2250. In all of our flux capacitor memories, the lower and upper synapse transistors are minimum size, and in most cases the synapse capacitances are equal

An Analog Memory Circuit for Spiking Silicon Neurons

427

value and derived solely from parasitic sources such as drain diffusions and short segments of wiring, which amounts to an effective capacitance on the order of 1–5 fF. The footprint required for the holding capacitor is not very large for low-precision applications. For example, a 6-bit flux capacitor requires a Ch footprint of about 25 × 25 square microns for a standard IC process available through MOSIS. The data shown in Figures 2, 3, and 4 were taken from a flux capacitor that had a Ch of approximately 0.5 pF, so its maximum effective precision was nearly 7 bits. Since there is the inevitable change in state due to charge leakage, the precision to which a particular state can be set will also depend on the average synaptic activation rate, T. In general, the activation rate must be significantly faster than the leakage time constant to minimize the reduction in bit precision—if bit precision is important for the application. Figure 3a shows experimental data plotting flux capacitor output voltage as a function of average activation rate when both upper and lower synapses were activated at the same rate. The experimental data are plotted along with the expected output computed using equation 2.1 with parameters set to their measured or estimated values. A constant voltage of 0.9 was added to the experimental data to compensate for the source-follower transfer function. As can be seen, the output from this flux capacitor is fairly constant for frequencies above approximately 10 Hz. An upper bound on the reduction in bit precision can be determined by equating the peak-to-peak noise voltage (cf. equation 2.6) to an expression that describes the decay in voltage due to charge leakage alone, as in equation 2.8 (left side). Thus, when the synaptic weights are equal (W = Wu = WL ), the degradation in precision from theoretical, given by equation 2.7, can be limited to 1 bit by ensuring that the average activation period, T, is no greater than that given in equation 2.8 (right side): ³ −T ´ W · (Vu − VL ) = (Vh − VL ) · 1 − e RCh µ

(Vu − VL ) · W T ≤ −R · Ch · log 1 − (Vh − Vl )

¶ .

(2.8)

2.3 Dynamics. Although the flux capacitor could be used as a type of static analog memory device whose bit precision is determined by decay and synaptic activation rates, its principal use is intended to be as a networkdependent dynamic memory. In nervous systems, there may exist many examples of adaptive, network-dependent effects such as membrane potentiation, modification of dendrite cable properties, or even structural changes that may have a short-term memory function. Such state changes could play an important role in modifying and regulating behavior. In artificial nervous systems, network-dependent, short-term memories may be essential for similar purposes. Although floating gate devices have been proposed as the basis for an adaptive memory (e.g., Hasler et al. 1995), we believe that

428

John G. Elias, David P. M. Northmore, Wayne Westerman

Figure 3: (a) Flux capacitor output as a function of average activation frequency (1/T) when both upper and lower synapses are activated at the same rate. Theoretical curve computed from equation 2.1 using the following parameters: WL = Wu = 0.01, RCh = 700 sec, Vu = 5.0, fu = fL , VL = 0, and Vl = 0. (b) Measured flux capacitor voltage decay from a set state after synaptic activation ceases. Two different devices are shown: flux capacitor 9 (FC9) had a 25 × 40 square micron MOS holding capacitor; flux capacitor 3 had a 115 × 115 square micron poly-poly capacitor. Calculated time constants are 700 sec for FC9 and 2700 sec for FC3.

An Analog Memory Circuit for Spiking Silicon Neurons

429

the flux capacitor offers more advantages for use in short-term, networkdependent memory applications. In particular, the flux capacitor requires no elevated voltages, its memory state and dynamics are determined by direct synaptic activation, and its state can be precisely set without the need for additional feedback or monitoring circuitry. Equation 2.5 shows that each synaptic activation produces either an upward or a downward movement of flux capacitor voltage. Dynamics therefore is entirely determined by the synaptic activation rate: the more slowly flux capacitor synapses are activated, the slower its state changes (this assumes that the leakage time constant is long compared to the average rate of synaptic activation). The fastest rate at which synapses can be activated is dependent on the connection system employed (Mahowald 1991; Elias 1993). With our existing virtual wire connection system, flux capacitor synapses could be activated as fast as 107 activations per second. The state trajectory that the flux capacitor follows depends on the ordering of synaptic activations. The final state, however, is independent of the order of activating a fixed number of upper and a fixed number of lower synapses, provided that the supply potentials (Vu or VL ) were not encountered at some point along the state trajectory. Figure 4 shows three examples of the flux capacitor state following a prescribed path. Each trajectory was generated by a particular temporal pattern of upper and lower synaptic activations applied to a flux capacitor through our virtual wire network connection circuit. The trajectories can easily be made to fit in a wide range of different time scales. As an example, the state trajectory shown in Figure 4b was generated with equal fidelity over several time scales ranging from 800 µsec to 80 sec. The limit to how far the time scale can be stretched or compressed depends, respectively, on the leakage decay rate and the maximum rate at which synapses can be activated. Since the flux capacitor state can be rapidly changed and set by synaptic activations, a question arises on how to embed the circuit in a network in order to produce desirable (i.e., appropriately adapting) behavior. In the next section, we present several experiments that show interesting adaptation responses to specific network activity when the flux capacitor state is used as the spike firing threshold for silicon neurons. 3 Network-Dependent Spike Threshold 3.1 Hardware System. Network-dependency experiments were conducted using our silicon neuron or neuromorph (Elias and Northmore 1995) having an integrated flux capacitor as a means of setting the spike firing threshold of its integrate-and-fire “soma” (see Fig. 5). The dendritic branches of the neuromorph are composed of a series of compartments, each with a capacitor representing a membrane capacitance, and two programmable resistors representing the axial cytoplasmic and membrane resistances. Synapses at each compartment (small vertical lines in Fig. 5) are

430

John G. Elias, David P. M. Northmore, Wayne Westerman

Figure 4: State trajectories measured from a flux capacitor using an on-chip source follower. The temporal patterns of synaptic activation that produced the trajectories were delivered to flux capacitor synapses by a network connection circuit called virtual wires (Elias 1993). (a) Sine. (b) Damped sine. (c) Sum of three sine waves. The same or arbitrary trajectories can be generated over widely different time intervals. For example, data shown in (b) were collected using several different sampling periods spanning 800 µsec to 80 sec.

emulated by a pair of MOSFETs, one of which, the excitatory synapse, enables inward current, moving the potential of the membrane capacitance in a depolarizing direction. The other transistor, emulating an inhibitory synapse, enables outward current, hyperpolarizing the membrane capacitance. There are also shunting or silent inhibitory synapses that pull the membrane potential toward a potential close to the resting value. The potential appearing at the soma (Vs ; see Fig. 5) determines the output spike firing rate in conjunction with the integration time constant, RC, and the threshold, Vth . The spike firing threshold is derived directly from a flux capacitor whose state is set by network spiking: activating the upper threshold

An Analog Memory Circuit for Spiking Silicon Neurons

431

Figure 5: A silicon neuron with an artificial dendritic tree (ADT) and integrateand-fire “soma.” Vs is “membrane voltage” at the soma; R and C determine the integration time constant; R is programmable. The spike firing threshold, Vth , is the state of a flux capacitor, which is set by network activity: activating upper threshold synapse (filled circle) raises spike firing threshold voltage; activating lower synapse reduces it.

synapse (filled circle) raises the spike firing threshold voltage; activating the lower threshold synapse (open circle) reduces it. All synapses (on dendrites and on threshold-setting flux capacitor) are activated by a 50 nsec impulse signal. Experiments were performed using multiple neuromorphs having fourand eight-branched dendritic trees. The dendrite dynamics was set to give membrane time constants in the range of 10 msec. The neuromorphs were embedded in the “virtual wire system” previously described (Elias 1993). This system allows spikes generated by thousands of neuromorphs to be distributed with programmable delays to arbitrary synapses throughout a network. A host computer generated the spatiotemporal patterns of spikes used as test stimuli for the network. It also recorded the soma membrane potential, Vs , its spiking threshold, Vth , and read the “spike history,” which is part of the virtual wire system, to obtain the times of occurrence of the output spikes generated by all of the neuromorphs. 3.2 Sustained and Transient Responses. Figure 6a depicts connections to a neuromorph showing how its excitability can be controlled by network activity. In the absence of other inputs to the threshold-setting synapses, a fixed threshold for the integrate-and-fire soma is achieved by delivering tonic spike trains with frequencies fu and fL to the upper and lower threshold-setting synapses of the soma (see equation 2.1). Using the virtual wire system, these spike trains could be derived from the activity of any neuromorph in the network, but for experimental purposes, they are conveniently generated by pacemaker neuromorphs whose uniform spiking rates can be easily programmed. Because of the long time constant of the flux capacitor (∼ 700 sec), the pacemaker frequencies need to be only a few spikes per second.

432

John G. Elias, David P. M. Northmore, Wayne Westerman

Figure 6: (a) Two-branch dendritic tree neuromorph. A spike train of 1 sec duration is input to a proximal excitatory synapse on one branch. Tonic spike trains of frequency fu and fL are delivered to the upper and lower threshold-setting synapses along with various amounts of feedback to upper synapse from output. (b) Soma voltage (Vsoma ), and the threshold voltage (Vth ), as a function of time, together with output spikes. At 0.5 sec (arrow), a 100 spikes per second input train was applied to the excitatory synapse for 1 sec while fu = 0 and fL = 10. When output stops firing (∼ 1.5 sec) threshold slowly returns to the previous level due to tonic activation of lower threshold synapse. (c) Spike output frequency for four different combinations of fu and fL and feedback. Curve 1: fu = 12.5, fL = 20, no feedback; curve 2: fu = 42, fL = 111, feedback = one upper activation per output spike; curve 3: fu = 0, fL = 12.5, feedback = one upper activation per output spike; curve 4: fu = 0, fL = 12.5, feedback = two upper activations per output spike.

An example of self-regulation of neuromorph excitability occurs when tonic spike trains are supplied to the threshold-setting synapses, and the neuromorph’s own output spikes are fed back with zero delay to its thresholdsetting synapses. The tonic spike input sets the spike firing threshold according to equation 2.1, which, if below the “membrane-resting potential,”

An Analog Memory Circuit for Spiking Silicon Neurons

433

produces output spiking in the absence of any dendritic input. The feedback connections adjust the spike firing threshold (cf. equation 2.5) every time the neuromorph fires. In the absence of any synaptic input to the dendritic tree, the neuromorph achieves an average firing rate of f0 spikes per second given by −1 ³ ´, (3.1) f0 = Vu RC · log 1 − Vsoma ·(1+ fL /fu ) where Vsoma is the “membrane potential,” RC is the integrate-and-fire soma time constant (see Fig. 5), and fL and fu are the activation frequencies of the lower and upper threshold synapses. The threshold synapse activation frequencies are derived from the feedback of the neuromorph’s own output and the outputs of any in the network: fu = a∗ f0 + Pother neuromorphs P P P ∗ flower , where fupper and flower represent the fupper ; fL = b f0 + summed activation frequency due to all other neuromorphs whose outputs connect to the threshold setting synapses; a and b are the number of feedback connections to the upper and lower threshold synapses, respectively. Figure 6b shows the response evoked by a 100 Hz input spike train of 1 sec duration, delivered to an excitatory synapse on the dendritic tree. The soma potential, Vsoma , and the threshold potential, Vth , are plotted together with a sample train of output spikes. A tonic spike train of 12.5 spikes per second was applied to the lower threshold-setting synapse, and the neuromorph’s output was fed back to its upper threshold synapse, which resulted in an average output firing rate of around 10 spikes per second. During the 1 sec input train (beginning at the arrow), Vsoma is raised 240 mV, evoking a transient burst of output spikes followed by sustained firing at a slower rate. The output response accommodates because each fed-back output spike raises the soma threshold according to equation 2.5. At the end of the input train, and for more than a second after, the threshold exceeds Vsoma , and spike firing ceases. With each activation of the lower threshold synapse, Vth drops according to equation 2.5. Eventually Vth falls below Vsoma , causing the neuromorph to fire at a regular rate again. Figure 6c (curve 3) shows the instantaneous output firing frequency obtained with these connections. The sustained response component relative to the transient component can be emphasized by increasing the rate of lower threshold synapse activation with respect to the feedback activation rate of the upper threshold synapse. With no feedback, a pure sustained response is observed (curve 1 in Fig. 6c). With a single feedback activation, increased transient response occurs (curves 2 and 3). Increasing the number of feedback activations of the upper synapse from each output spike generates sharper transient responses (curve 4). In the recordings of curves 1 and 2, an additional tonic input to the upper threshold synapse was employed to sustain approximately the same average firing rate in the absence of input as with the recording of curve 3 (see equation 3.1).

434

John G. Elias, David P. M. Northmore, Wayne Westerman

Figure 7: (a) A two-branch dendritic tree neuromorph with connections for bistable behavior. A 1-second-long input spike train of 100 spikes per second was delivered to either a strong excitatory synapse (open circle on dendrite) or to a strong inhibitory synapse (filled circle). Tonic spike trains of 20 and 25 per second activated upper and lower threshold-setting synapses, respectively. Each output spike activated the upper threshold synapse twice and the lower threshold synapse four times. (b–e) Soma voltage (Vsoma ), threshold (Vth ), and peri-stimulus time histogram of output spike firing. (b) Response to excitatory input in low threshold state. Excitatory input does not cause low-to-high threshold state transition. (c) Inhibitory input effects change from low to high threshold state. (d) Inhibitory input in high threshold state does not affect state. (e) Excitatory input changes state from high to low threshold.

3.3 Bistable Behavior. The connections to the threshold-setting synapses shown in Figure 7a cause Vth to move to one of two defined levels, allowing the neuromorph to exhibit bistable behavior. Tonic spike trains of 20 and 25 spikes per second activating the upper and lower synapses alone move Vth to ∼ 2.22 V (cf. equation 2.4, Vu = 5). This threshold value prevents the neuromorph from firing spontaneously. However, when the neuromorph receives a brief burst of excitatory dendritic input, thus causing it to fire, Vth moves low enough (∼ 1.7 V) due to feedback activations to enable continuous firing in the absence of excitatory input. The soma firing frequency is then sufficient to hold the threshold low and overcome the tendency of the tonic spike trains to raise the threshold. Figure 7b shows the neuromorph firing at a spontaneous rate of about 82 per P second (cf. equation 3.1 with P fupper = 20, flower = 25), RC = .005, Vu = 5, Vsoma = 1.88, a = 2, b = 4, the rate rising temporarily under the influence of a 1 sec train of spikes to an excitatory synapse on the dendrite. In Figure 7c, a strong inhibitory input

An Analog Memory Circuit for Spiking Silicon Neurons

435

to the dendrite shuts off firing, allowing Vth to rise to its upper level under the influence of the tonic input spikes. It remains in the high-threshold state (see Fig. 7d), until a strong excitatory input on the dendrite generates sufficient output spiking to switch the neuromorph to the low-threshold state again (see Fig. 7e). The neuromorphic circuit is capable of processing synaptic inputs in either state, provided that these are not strong enough to cause a change of state. In the high-threshold state, inputs are subjected to a thresholding nonlinearity. Excitatory input trains that are brief and relatively weak in their depolarizing effects (e.g., 50 spikes per second for 100 msec activating a proximal synapse) elicit just a few output spikes without causing a state change. In the low-threshold state, there is a much more linear relationship between output and input. 3.4 Classical Conditioning. As a further illustration of the application of the flux capacitor, Figure 8a shows a neuromorphic network that exhibits the main features of classical conditioning. The unconditioned stimulus (US) and conditioned stimulus (CS) are input trains of 100 Hz, each lasting 0.5 sec. Presentation of the US alone reliably excites the Output neuromorph, leading to an unconditioned response. The CS by itself is normally unable to excite Output but may do so via the Associator neuromorph, if the Associator threshold is sufficiently low. Lowering this threshold is the function of the Correlator, which fires when coincident spikes occur in the US and CS trains (Northmore and Elias 1996). Thus during training, when CS and US are paired as shown in Figure 8b, Correlator spikes delivered to the lower synapse of the Associator’s flux capacitor gradually decrement its Vth , until the CS alone is able to excite Output. A mechanism to limit threshold lowering in the Associator is required to prevent its firing excessively and to raise its threshold when CS and US are uncorrelated (i.e., to extinguish the conditioned response). The threshold synapses of the Associator are not driven by tonic spike trains because this would shorten the time constants of Vth changes. The role of the Differentiator, then, is to convert trains of spikes from several sources into brief bursts that raise the Associator’s threshold. This is achieved by feeding back the Differentiator’s output spikes to its own threshold-setting synapses in the ratio 4:2 (upper:lower). The effect of six threshold-setting synaptic activations per output spike ensures that the Differentiator’s threshold moves quickly to a higher level; this tends to shut off further firing. The same principle was used to produce the transient responses shown in Figure 6. Figure 8b shows a typical sequence of spike trains before and during conditioning and during extinction with CS alone. Prior to training, the output response is evoked by the US but not the CS. As training proceeds by pairing of CS and US so that they overlap in time, the Correlator fires, lowering the Associator’s threshold. After a few pairings of CS and US, the Associator’s response gains in vigor and comes to anticipate the US. The CS alone is then sufficient to elicit a strong output response. Then, however, continued

436

John G. Elias, David P. M. Northmore, Wayne Westerman

Figure 8: Classical conditioning in a four-neuromorph network. (a) Inputs to the network are the unconditioned stimulus (US) and the conditioned stimulus (CS), both 100 Hz trains of 0.5 sec duration. All neuromorphs have four equal dendritic branches. Open circles on the dendrites show excitatory synaptic connections; filled circles, inhibitory. Circles on the neuromorph soma are the upper and lower threshold-setting synapses. The Correlator and Output neuromorph thresholds are set by tonic spike trains to their upper and lower synapses (connections not shown). Feedback to the Differentiator threshold synapses is in the ratio 4:2, upper to lower. Spike firing activity of all four neuromorphs is shown in (b). Left column, upper: before training, US alone evokes output response; CS alone evokes no activity. Training occurs by overlapped paring of CS and US (lower left and upper right columns). The Correlator fires during the overlap of the CS and US. The Associator gradually starts responding and, with the Output response, comes to anticipate the US. Lower right column: CS presented alone. Output fires but gradually extinquishes because Differentiator acts to raise Associator’s threshold.

An Analog Memory Circuit for Spiking Silicon Neurons

437

firing of the Differentiator drives up the Associator’s threshold, leading to extinction of the conditioned response. In the absence of input activation, the persistence of the conditioned association, once formed, depends only on the decay of flux capacitor state. 4 Discussion Negative feedback, provided by routing output spikes to the upper threshold synapse, allows neuromorphs to adjust their own excitability, thus stabilizing the spike firing frequency of the integrate-and-fire soma. Negative feedback also minimizes the effects due to variation between individual neuromorphs. However, it is control over the flux capacitor dynamics that provides temporal processing capabilities of generally longer duration than those obtained through the dendritic tree alone (Northmore and Elias 1996). If the threshold is made to rise rapidly with output firing, the strongly transient responses shown in Figure 6 result. With the addition of interneuromorphs, feedback spike frequency can be divided down, making for much slower rates of change in excitability. Thus, on a relatively short time scale, rapidly adapting effects giving temporal edge enhancement can be obtained; with longer dynamics, habituating type effects may be obtained. Sensory systems commonly contain neurons with different temporal signaling characteristics; some are transient in nature, responding best to stimulus onsets and offsets, while others respond in a more sustained fashion. Such differences may be evident at the level of sensory receptors (e.g., mechanoreceptors in the skin), or they may be accentuated by subsequent neural filtering. Thus, for example, high-pass filtering in the insect compound eye (McLaughlin 1981) and in the vertebrate retina (Dowling 1987) generates signals that mark only onsets or offsets of stimulation. Sensory systems generally transmit both sustained and transient information in parallel pathways (e.g., M- and P-cell systems; RA and SA in somatosensory system), probably as a way of maximizing channel capacity. Transients convey information that commands attention and possibly rapid response, whereas sustained responses signal the continued presence of stimuli, allowing a slower, more discriminative processing. The feedback control of neuromorph excitability makes possible a range of sustained and transient signaling that could be as important for robotics as it is for biological sensory systems. A useful property of the flux capacitor is that it can be set to particular voltage levels depending on the ratios of activations of the upper and lower synapses. Moreover, the level can be set by prolonged trains or by very brief bursts of activations, the level depending almost entirely on the ratio of the number of upper to lower synapse activations (see equation 2.4). This property allows the threshold to settle at various levels depending on which spike source dominates the threshold-setting mechanism at any

438

John G. Elias, David P. M. Northmore, Wayne Westerman

given time. The resulting bistable behavior is reminiscent of the two or more modes exhibited by neurons at various sites in the central nervous system, most notably the thalamus (McCormick et al. 1992) where input spikes are processed differently according to the threshold state. In the neuromorphic circuit illustrated in Figure 7, processing is approximately linear for small signals in the low threshold state and nonlinear in the high threshold state. The classical conditioning network was intended as a small-scale demonstration of how flux capacitors in a network can be used for local memory storage. In this illustration, they are used either as an effective weight storage element that may be altered by training (e.g., in the Associator) or as a memory for determining the temporal characteristics of neuromorph response (as in the Differentiator). We envision that large numbers of flux capacitors, being compact, could be distributed throughout networks as short- to intermediate-term storage to control synaptic weights and other neuromorph parameters. The “virtual wire” interconnection system (Elias 1993) that made the tiny classical conditioning network possible is designed to route spikes generated by thousands of sources, including neuromorphs and external sensors, to even larger numbers of synaptic destinations, many of which will be flux capacitor storage elements. 5 Conclusions We have presented a design for a new type of simple, four-transistor integrated analog memory circuit, the flux capacitor. Its state and dynamics are set by the average charge flux through its two synapses, which are directly activated with pulsatile signals originating from an arbitrary number of neuromorph sources. The principal use of the flux capacitor is intended to be as a network-dependent dynamic memory in artificial nervous systems. Its state may be incremented and decremented over a wide voltage range, as well as being set to particular levels, all by means of the spiking activity generated in a network. When its state is used as a threshold voltage for an integrate-and-fire neuromorph, output firing adaptation is possible, as demonstrated here. Work is underway to use arrays of flux capacitors to bring dendrite dynamics, soma integration time constants, and synaptic conductances under the control of activity in a network. Acknowledgments The research reported in this article was supported by National Science Foundation grants BCS-9315879 and BEF-9511674. References Advanced Micro Devices. 1995. Data sheet for AM29F010 Flash Digital Memory. Sunnyvale, CA.

An Analog Memory Circuit for Spiking Silicon Neurons

439

Alspector, J. 1987. A neuromorphic VLSI learning system. Proc. 1987 Stanford Conf. Advanced Research in VLSI, 313–349. Allen, P. E., and Sanchez-Sinencio, E. 1984. Switched Capacitor Circuits. Van Nostrand Reinhold, New York. Diori, C., Hasler, P., Minch, B., and Mead, C. 1995. A high-resolution non-volatile analog memory cell. IEEE Int. Symposium on Circuits and Systems, 2233–2236. Douglas, R., and Mahowald, M. A. 1995. Personal communication. Dowling, J. 1987. The Retina: An Approachable Part of the Brain. Harvard University Press, Cambridge, MA. Elias, J. G. 1993. Artificial dendritic trees. Neural Comp. 5, 648–664. Elias, J. G., and Northmore, D. P. M. 1995. Switched-capacitor neuromorphs with wide-range variable dynamics. IEEE Trans. Neural Networks 6 (6), 1542–1548. Frohman-Bentchkowsky, D. 1971. Memory behavior in a floating-gate avalanche-injection MOS (FAMOS) structure. Applied Physics Letter 18. Hasler, P., Diori, C., Minch, B. A., and Mead, C. 1995. Single transistor learning synapses. Advances in Neural Information Processing Systems 7, 817–824. Holler, M., Tam, Si., Castro, H., and Benson, R. 1989. An electrically trainable artificial neural network (ETANN) with 10240 “floating gate” synapses. International Joint Conference on Neural Networks, Washington, DC, II-191–196. Lee, B. W., Sheu, B. J., and Yang, H. 1991. Analog floating-gate synapses for general-purpose VLSI neural computation. IEEE Trans. on Circuits and Systems 38 (6), 654–658. Lenzlinger, M., and Snow, E. H. 1969. Fowler-Nordheim tunneling into thermally grown SiO2. J. Appl. Phys. 40, 278–283. Mahowald, M. A. 1991. Evolving analog VLSI neurons. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., Chap. 15. Academic Press, San Diego, CA. Mahowald, M. A., and Douglas, R. 1991. A silicon neuron. Nature 354, 515–518. Mead, C. 1989. Analog VLSI and Neural Systems. Adaptive retina. In Analog VLSI Implementations of Neural Systems. C. Mead and M. Ismail, eds., 239–246. Kluwer, Norwell, MA. McCormick, D. A., Huguenard, J., and Strowbridge, B. W. 1992. Determination of state-dependent processing in thalamus by single neuron properties and neuromodulators. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., 8: 1245–1265. Academic Press, San Diego, CA. McLaughlin, S. B. 1981. The role of sensory adaptation in the retina. J. Exp. Biol. 146, 39–62. Murray, A., and Tarassenko, L. 1994. Analogue neural VLSI—A Pulse Stream Approach. Chapman & Hall, London. Northmore, D. P. M., and Elias, J. G. 1996. Spike train processing by a silicon neuromorph: The role of sublinear summation in dendrites. Neural Comp. 8, 1245–1265. Shepherd, G. M., and Koch, C. 1990. Dendritic electrotonus and synaptic integration. In Synaptic Organization of the Brain, G. M. Shepherd, ed., 439–473. Oxford University Press, New York. Sheu, B. J., and Hu, C. M. 1983. Modeling the switch-induced error voltage on a switched-capacitor. IEEE Trans. on Circuits and Systems cas-30 (12), 911–913.

440

John G. Elias, David P. M. Northmore, Wayne Westerman

Shoemaker, P., Lagnado, I., and Shimabukuro, R. 1988. Artificial neural network implementation with floating gate MOS devices. In Hardware Implementation of Neuron Nets and Synapses, a workshop sponsored by NSF and ONR, San Diego, CA. Wilson, W. B., Massoud, H. Z., Swanson, E. J., George, R. T., and Fair, R. B. 1985. Measurement and modeling of charge feedthrough in e-channel MOS analog switches. IEEE J. Solid-State Circuits SC-20 (6), 1206–1213.

Received January 19, 1996; accepted June 4, 1996.

Communicated by Christopher Bishop and John Platt

Average-Case Learning Curves for Radial Basis Function Networks Sean B. Holden Department of Computer Science, University College London, Gower Street, London WC1E 6BT, England

Mahesan Niranjan Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 1PZ, England

The application of statistical physics to the study of the learning curves of feedforward connectionist networks has to date been concerned mostly with perceptron-like networks. Recent work has extended the theory to networks such as committee machines and parity machines, and an important direction for current and future research is the extension of this body of theory to further connectionist networks. In this article, we use this formalism to investigate the learning curves of gaussian radial basis function networks (RBFNs) having fixed basis functions. (These networks have also been called generalized linear regression models.) We address the problem of learning linear and nonlinear, realizable and unrealizable, target rules from noise-free training examples using a stochastic training algorithm. Expressions for the generalization error, defined as the expected error for a network with a given set of parameters, are derived for general gaussian RBFNs, for which all parameters, including centers and spread parameters, are adaptable. Specializing to the case of RBFNs with fixed basis functions (basis functions having parameters chosen without reference to the training examples), we then study the learning curves for these networks in the limit of high temperature. 1 Introduction Statistical physics has made significant contributions to the study of learning curves for feedforward connectionist networks. (It has also been applied with great success to the analysis of recurrent networks, although we do not consider the latter here. An excellent reference is Amit 1989.) In this context, a learning curve can be defined as a graph of expected generalization performance against training set size; the term is defined in full below. Most of the research that has appeared to date has been concerned with networks that do not have hidden layers and are applied to the learning of similarly structured rules—for example, a perceptron learning another perceptron Neural Computation 9, 441–460 (1997)

c 1997 Massachusetts Institute of Technology °

442

Sean B. Holden and Mahesan Niranjan

from examples (see, for example, Seung et al. 1992). Given that the use of simple networks such as these in practice is quite rare, the study of more complex networks and rules constitutes one of the most important directions for continuing research in this area (Watkin et al. 1993). Most of the analysis to date for more complex networks has been confined to committee machines (see, for example, Schwarze et al. 1992 and Barkai et al. 1992) and parity machines (Watkin et al. 1993 and references therein), which are again not networks that are applied to any significant extent in practice. This article presents results on the extension of the theory to the class of radial basis function networks (RBFNs), which have been used with considerable success in practical applications (see, for example, Renals & Rohwer 1989 and Niranjan & Fallside 1988). The learning curves for these networks have already been studied in detail in a predominantly worst-case framework based on probably approximately correct (PAC) learning theory (Anthony & Biggs 1992); the results of these studies can be found in Holden and Rayner (1995), Anthony and Holden (1994), and Lee et al. (1994). In this article, we predominantly consider RBFNs having fixed basis functions (that is, basis functions having parameters chosen without reference to the training examples). Section 2 reviews RFBNs, defines the learning problems of interest, provides a brief review of the areas of statistical physics that are relevant to the article, and maps out the mathematical development of the rest of the article. Section 3 addresses the first step in the analysis of a network using this formalism, namely, to calculate the generalization error of the network for the tasks of interest. The generalization error in this context is the expected error, for randomly selected input vectors, of a network having a given set of parameters. Section 4 addresses the derivation of learning curves. It has not been possible to date to obtain completely general results in this part of the analysis, and consequently we investigate networks having fixed basis functions and consider learning curves for the high-temperature limit. Section 5 discusses the results obtained, along with some related work (Freeman & Saad 1995), and provides some suggestions for further work. Section 6 concludes the article.

2 Preliminaries This section briefly reviews the theory of RBFNs and fully defines the learning problems of interest. It then provides a short introduction to the areas of statistical physics relevant here and explains the mathematical development of the rest of the article. We do not provide a full account of the statistical physics relevant to the analysis of learning curves; such an account can be found in Seung et al. (1992), Watkin et al. (1993), Hertz et al. (1991), and Landau and Lifshitz (1980).

Average-Case Learning Curves for Radial Basis Function Networks

443

2.1 General Notation. We consider feedforward networks having p realvalued inputs, denoted by the vector x ∈ Rp (vectors are assumed to be column vectors throughout this paper). Input vectors are assumed to be generated independently at random according to a probability density p(x). The network computes a function f (x, w) : Rp × Rq → R that depends on a vector w ∈ Rq of q real-valued parameters. Note that we address networks with real-valued outputs; networks of this type are typically applied to problems such as function interpolation, time-series prediction, and so forth. We consider functions f of a particular form, which will be discussed in the next subsection. In general, networks having binary-valued inputs, outputs, or weights can also be investigated using this formalism; however, we do not address such networks here. 2.2 Radial Basis Function Networks. We focus on RBFNs, which were introduced by Broomhead and Lowe (1988) and have a strong theoretical basis in interpolation theory (see, for example, Powell 1985, 1990). A convenient way in which to write the function computed by a neural network with a single hidden layer of k units is f (x, w) = α0 +

k X

αi φ(x, β i ) ,

i=1

where w is the vector of all variable weights, w T = [ α0

···

αk

β T1

···

β Tk ] ,

and φ(·) is called a basis function. In this article we assume that α0 = 0. For the case of RBFNs, we use φ(x, β i ) = φ(σi , kx − ui k) , where the ui are called the centers of the basis functions, and k·k is a norm that in this article is always the Euclidean norm. Several forms are available for the function φ; however, we will use the most popular: the gaussian basis function, ¸ · 1 φ(x, β i ) = exp − 2 kx − ui k2 . σi

(2.1)

The vectors β i of parameters can be written β Ti = [ σi uTi ]. For gaussian basis functions, σi controls the width of the ith basis function. In applying these networks in their full generality, it is usual to adapt all the available parameters in response to training examples. In this case we have q = k(p + 2) variable parameters in total. However, good results can also be obtained using simplified networks in which only the parameters αi

444

Sean B. Holden and Mahesan Niranjan

are adapted, and hence q = k. In the latter case we deal with networks that compute functions, · ¸ k 1 X 1 αi exp − 2 kx − ui k2 , f (x, α) = √ σi k i=1 where the ui are fixed and distinct, the σi are fixed, and αT = [ α1 α2 · · · αk ] is the vector of adaptable weights. For our purposes, we assume that when only weights in α are adapted, the centers and the σi parameters are chosen without reference to the training examples. For example, centers can be chosen randomly but cannot be chosen using clustering techniques (Moody & Darken 1989) or such that they coincide with a subset of the training examples (Broomhead & Lowe 1988). An important observation can be made at this point. When only the parameters in α are adaptable, the networks of interest operate by mapping input vectors of dimension p to a new space of dimension k, and mapping the resulting vectors to an output using a linear network having parameter vector α. Many results are available for such linear networks (Watkin et al. 1993; Seung et al. 1992); however, such results usually apply to networks having gaussian-distributed inputs, whereas in this case such inputs are clearly non-gaussian, as the outputs of the basis functions of equation 2.1 are strictly positive. Some of the later results presented here can therefore be regarded as an extension of earlier results to certain situations where inputs are not gaussian-distributed. 2.3 Target Rules and the Learning Problem. We study the supervised learning of a rule from a sequence of P noise-free training examples. We let P = ψq where q is the number of variable parameters used by the network, and we assume that training examples are generated using a target rule fT : Rp → R. The training sequence TP is generated by drawing P inputs independently at random according to a fixed probability density p(x) and forming TP = ((x1 , fT (x1 )), . . . , (xP , fT (xP ))). Throughout this article we assume that p(x) is the gaussian density function, · ¸ 1 T 1 exp − x x . p(x) = 2 (2π)p/2

(2.2)

We consider two types of target rule. In the first, fT has the same form as the RBFN of interest, and the function fT therefore has the form " # k 1 X 1 2 α i exp − 2 kx − ui k , fT (x) = √ σi k i=1

Average-Case Learning Curves for Radial Basis Function Networks

445

where α i , σ i , and ui for i = 1, 2, . . . , k are fixed parameters that define fT . Overbars are used in this manner throughout the article to denote parameters associated with the target rule. Note that if ui = ui and σi = σi for i = 1, 2, . . . , k, then the target rule is realizable; if the stated equalities do not hold, then this may not be the case. The second type of target rule considered is a simple linear rule, 1 fT (x) = √ γ T x , p where γ ∈ Rp is a fixed vector of parameters. 2.4 Statistical Physics. We begin with a few definitions. The error ²(x, w) of a network on a given input is, as usual, a measure of the distance between its output and the output of the target rule for the corresponding input, ²(x, w) =

1 [ f (x, w) − fT (x)]2 . 2

Further, we use the usual measure for the training error, E(w, TP ), defined as E(w, TP ) =

P X

²(xi , w)

i=1

for the xi in the training sequence TP . We denote by Z ²(w) =

²(x, w)p(x) dx

(2.3)

the expected error of the network for a given weight vector, which is called the generalization error for that weight vector. The calculation of this quantity forms the first, crucial step in obtaining expressions for learning curves using this formalism, and its calculation for the target rules of interest is the subject of Section 3. A central assumption in much of the treatment of learning curves using statistical physics is that the network is trained using a noisy dynamics leading at equilibrium to the Gibbs density (Watkin et al. 1993; Seung et al. 1992; Levin et al. 1990), pG (w) =

1 p(w) exp[−βE(w, TP )] , Z

where β = 1/T; T denotes the temperature; p(w) is a prior density on the weights; and the relevant partition function is Z Z= p(w) exp[−βE(w, TP )] dw .

446

Sean B. Holden and Mahesan Niranjan

The training dynamics can be considered intuitively to be a form of noisy gradient descent, and the temperature, in this context, denotes the amount of noise present in the training dynamics. It is usual in this type of analysis to use the prior p(w) to impose constraints on the ranges of the weights. We will not use this approach here but this point is discussed in more detail in Section 4. Averages with respect to the Gibbs density, referred to as thermal averages, are as usual denoted by h·iT , and similarly, quenched averages over R QP P-element training sequences are denoted by hh·ii = i=1 p(xi )dxi such that the quenched average of a random variable r is Z (2.4) hhr(x1 , . . . , xP )ii = r(x1 , . . . , xP )p(x1 ) · · · p(xP )dx1 · · · dxP . The primary quantity of interest in this article is the expected generalization error, ²g (P, T) = hh h²(w)iT ii . Plots of ²g (P, T) against P for fixed T are called the learning curves for the network of interest. We will also be interested in the expected training error, ²t (P, T) = hh hP−1 E(w, TP )iT ii . It is straightforward to verify that ²t (P, T) = P−1

∂(βF) , ∂β

(2.5)

where F = −Thhln Zii

(2.6)

is the free energy. 2.5 The High-Temperature Limit. We will use a simplified version of the full theory known as the high-temperature limit (Seung et al. 1992). In the high-T limit, we let T → ∞ and ψ → ∞ while ψβ = constant. In this case the training error E(w, TP ) can be replaced by its average value hhE(w, TP )ii, and the expected training and generalization errors become equal. Consequently, the partition function takes the form Z £ ¤ p(w) exp −qψβ²(w) dw , (2.7) Z0 = and it is straightforward to show using equations 2.5 and 2.6 that ²g (P, T) = −

1 ∂ ln Z0 . P ∂β

(2.8)

Average-Case Learning Curves for Radial Basis Function Networks

447

The high-T limit has been used in the analysis of other types of networks by Schwarze et al. (1992) and Sompolinsky et al. (1990). 2.6 Outline of the Remainder of This Article. The remainder of this article is structured as follows. Note that equations 2.3, 2.7, and 2.8 tell us exactly how to calculate ²g (P, T), the quantity of interest. The first step is to calculate ²(w) using equation 2.3. This is the aim of Section 3: Section 3.1 calculates ²(w) for the case of nonlinear target rules, and Section 3.2 calculates ²(w) for the case of linear target rules. The partition function Z0 can then be obtained by calculating the integral given in equation 2.7, and finally ²g (P, T) can be obtained by differentiating (ln Z0 ) (equation 2.8). These steps are performed in Section 4. 3 The Generalization Error ²(w) A central quantity in the analysis of neural networks using this approach, and the first thing that we must calculate, is the generalization error ²(w), defined in equation 2.3. This quantity can be derived for the networks and target rules of interest for cases in which all parameters in the network are variable (ui and σi are not assumed to be fixed for i = 1, 2, . . . , k as described above), assuming that the density p(x) is the gaussian density of equation 2.2. Although later we specialize to the case of more restricted networks, we include the fully general derivation here as it is likely to prove useful in further work on the statistical physics of these networks. In the following sections we use some results for integrals having particular forms. These results are summarized in Appendix A. 3.1 Nonlinear Target Rules. The network of interest computes the function, k 1 X αi φ(x, β i ) , f (x, w) = √ k i=1 where φ is the gaussian basis function of equation 2.1. In this case the target rule is k 1 X α i φ(x, β i ) fT (x) = √ k i=1 for fixed, unknown weights α i , β i where i = 1, 2, . . . , k, and we can therefore write the error ²(x, w) of a network with weight vector wT = [ αT β T1 · · · β Tk ] and inputs x as ¤2 1£ f (x, w) − fT (x) 2 " #2 k k X 1 X = αi φ(x, β i ) − α i φ(x, β i ) . 2k i=1 i=1

²(x, w) =

448

Sean B. Holden and Mahesan Niranjan

Note that in the interest of generality, we regard for the time being the vectors P P Pk β i as containing variable weights. Introducing the notation ij ≡ ki=1 j=1 we can write, 

²(x, w) =

X 1 X αi αj φ(x, β i )φ(x, βj ) + α i αj φ(x, β i )φ(x, βj ) 2k ij ij  X αi αj φ(x, β i )φ(x, βj ) , −2 ij

and, hence, Z ²(w) =

²(x, w) p(x) dx Z 1 X αi αj φ(x, β i )φ(x, βj ) p(x) dx = 2k ij Z 1 X + φ(x, β i )φ(x, βj ) p(x) dx α i αj 2k ij Z 1X − φ(x, β i )φ(x, βj ) p(x) dx , αi αj k ij

(3.1)

and we therefore need to evaluate integrals of the form, Z I(a, b) =

φ(x, a)φ(x, b) p(x) dx ,

(3.2)

for a, b ∈ R(p+1) . We assume that the Gaussian basis function, · ¸ 1 φ(x, a) = exp − 2 kx − ua k2 σ ¸ · a 1 T = exp − 2 (x − ua ) (x − ua ) , σa

(3.3)

is used, where ua is the center of the basis function and σa controls its width. The parameter vectors a and b are written as aT = [ σa

uTa ]

bT = [ σb

uTb ] .

Average-Case Learning Curves for Radial Basis Function Networks

449

Inserting equations 2.2 and 3.3 in equation 3.2, we obtain, I(a, b) =

1 (2π)p/2

Z

µ · 1 2 (x − ua )T (x − ua ) exp − 2 σa2 !#

2 (x − ub )T (x − ub ) + xT x dx σb2 Ã " Ã ! Z 1 T 2 2 1 x x exp − + 2 +1 = 2 σa2 (2π)p/2 σb Ã ! Ã !!# 4 1 T 4 1 T T u ua + 2 ub ub dx . +x − 2 ua − 2 ub + 2 σa σa2 a σb σb +

We now apply the identity stated in equation A.1 in order to obtain, I(a, b) = ¡

1

¢p/2 1 + (2/σa2 ) + (2/σb2 ) ! Ã Ã " 1 1 2 ¢ uTa ua 2 × exp − − 4¡ 2 σa2 σa 1 + (2/σa2 ) + (2/σb2 ) ! Ã 1 2 ¢ uTb ub +2 − 4¡ σb2 σb 1 + (2/σa2 ) + (2/σb2 ) !# 8 T ¢ ua ub − 2 2¡ . (3.4) σa σb 1 + (2/σa2 ) + (2/σb2 )

Simplifying equations 3.1 and 3.4, we obtain the final result, Z ²(w) = =

²(x, w)p(x) dx 1 X 1 X αi αj I(ui , uj , σi , σj ) + α i αj I(ui , uj , σ i , σ j ) 2k ij 2k ij −

1X αi αj I(ui , uj , σi , σ j ) , k ij

where "

4 exp uTi uj − I(ui , uj , σi , σj ) = (σ ) σi2 σj2 σ 0 Ã ! # 1 2 T − − 4 0 uj uj σj2 σj σ 0 −p/2

µ

1 2 − 4 0 σi2 σi σ

¶ uTi ui

450

Sean B. Holden and Mahesan Niranjan

and σ0 = 1 +

2 2 + 2. σi2 σj

This is a general expression that applies when all the parameters in the network are regarded as variable weights. Consider now the restricted case in which only the parameters in α are considered to be variable weights. Let M be the symmetric matrix having elements Mij = I(ui , uj , σi , σj ). Also, let N be the matrix having elements Nij = I(ui , uj , σi , σ j ), and let αT = [ α 1 α 2 · · · α k ]. Then we can write, ²(α) =

1 T 1 c1 α Mα − αT n + 2k k k

(3.5)

where n = Nα and 1X α i αj I(ui , uj , σ i , σ j ) . c1 = 2 ij The weight vector α0 that minimizes ²(α) can now be found by differentiating equation 3.5. We require that 1 ∂²(α) = [Mα − n] = 0 , ∂α k and so,

α0 = M−1 n , where we assume that the matrix M has an inverse. Consequently, the minimum value for ²(α) is · ¸ 1 1 c1 − nT M−1 n . ²0 = ²(α0 ) = (3.6) k 2 3.2 Linear Target Rules. Having obtained expressions for the generalization error of an RBFN learning a similar RBFN, we now address the case of an RBFN learning a linear target rule. In this case, the target rule is 1 fT (x) = √ γ T x , p where γ is a fixed, unknown vector of parameters. We therefore have 1 ²(x, w) = 2 =

"

#2

k 1 X 1 αi φ(x, β i ) − √ γ T x √ p k i=1

k X 1 X 1 T αi αj φ(x, β i )φ(x, βj ) − γ x αi φ(x, β i ) 2k ij (kp)1/2 i=1

+

1 T x γ γTx . 2p

Average-Case Learning Curves for Radial Basis Function Networks

451

This leads to Z ²(w) =

²(x, w)p(x) dx Z 1 X = φ(x, β i )φ(x, βj )p(x) dx αi αj 2k ij Z 1 xT γ γ T x p(x) dx + 2p Z k 1 X − α γ T xφ(x, β i )p(x) dx . i (kp)1/2 i=1

For gaussian basis functions and gaussian-distributed inputs, the first term in this expression is exactly as derived above. The second term can be evaluated using equation A.4, and is given by Z 1 T 1 xT γ γ T x p(x) dx = γ γ. 2p 2p The integral in the final term can be evaluated using equation A.2, and is given by Z

γ T xφ(x, β i )p(x) dx =

2 γ T ui + 1)p/2 (2 + σi2 ) · µ ¶¸ 1 2 − × exp uTi ui . (2σi2 + σi4 ) σi2 ((2/σi2 )

For linear target rules, we therefore obtain the final result, Z ²(w) = =

²(x, w)p(x) dx k 1 X c2 1 X αi αj I(ui , uj , σi , σj ) − αi J(ui , σi , γ ) + , 1/2 2k ij (kp) p i=1

where I(ui , uj , σi , σj ) is as defined above, J(ui , σi , γ ) =

and c2 =

1 T γ γ. 2

2 γ T ui ((2/σi2 ) + 1)p/2 (2 + σi2 ) · µ ¶¸ 1 2 − × exp uTi ui (2σi2 + σi4 ) σi2

(3.7)

452

Sean B. Holden and Mahesan Niranjan

Equation 3.7 again applies when all parameters, including centers and spread parameters, are variable. Let o be the vector having elements oi = J(ui , σi , γ ). Then in the restricted case, when only the parameters in α are considered to be variable weights, we can write, ²(α) =

1 T 1 c2 α Mα − αT o + . 2k (kp)1/2 p

Consequently, we obtain ∂²(α) 1 1 = Mα − o, ∂α k (kp)1/2 such that

α0 =

k M−1 o (kp)1/2

and ²0 = ²(α0 ) =

· ¸ 1 1 c2 − oT M−1 o . p 2

4 Learning Curves In this section we consider the learning curves for RBFNs learning the target rules considered in the previous section in the high-temperature limit. For this limit we have the partition function, Z Z0 =

p(α) exp[−kψβ²(α)] dα ,

as discussed above, and the quantity of interest is ²g (P, T) = −

1 ∂ ln Z0 P ∂β

(equation 2.8). In order to calculate Z0 , we need to specify a prior density p(α) for the weight vectors α. The usual approach here is to use this prior density to constrain the ranges of the weights, and a density is usually specified that forces weight vectors to have a specified length (see, for example, Seung et al. 1992). This is not the approach that is generally applied in practical network design, and accordingly we will use a gaussian prior, ¸ · 1 p(α) = (2π)−k/2 σα−k exp − 2 αT α . 2σα

(4.1)

Average-Case Learning Curves for Radial Basis Function Networks

453

We begin by considering nonlinear target rules. In this case we have, Z0 = (2π)−k/2 σα−k · µ ¶ ¸ Z 1 T 1 T T × exp −ψβ α Mα − α n + c1 − 2 α α dα 2 2σα · Z ³ ´¸ 1 exp − αT Qα − 2ψβ αT n + 2ψβc1 dα , = (2π)−k/2 σα−k 2 where Q = ψβM + σα−2 I .

(4.2)

Applying the standard integral identity given in equation A.1 and rearranging, we obtain · ´¸ 1³ −k −1/2 2 2 T −1 exp − 2ψβc1 − ψ β n Q n , Z 0 = σα | Q | 2 where we assume that the matrix Q has an inverse. The learning curve for the high-T limit is now given by 1 ∂ ln Z0 P ∂β · ¸ 1 1 ∂ ³ 2 2 T −1 ´ 1 ∂ ψ β n Q n − ln | Q | −ψc1 . =− P 2 ∂β 2 ∂β

²g (P, T) = −

(4.3)

At this point we need to differentiate expressions containing the inverse and determinant of the matrix Q with respect to β. It is possible to do this for Q as defined in equation 4.2, and the relevant techniques are presented in Appendix B. However, the full expression obtained for ²g (P, T) is rather difficult to interpret, and a significantly more enlightening expression can be obtained if we consider the case in which σα is large—in effect, the case in which we place no constraints on the ranges of the weights—such that, Q ' ψβM and hence, · ¸ 1 1 k T −1 ψn M n − ψc1 − ²g (P, T) = − P 2 2β · ¸ 1 1 1 c1 − nT M−1 n + . = k 2 2ψβ Comparing this expression with equation 3.6, we obtain the final result, ²g (P, T) = ²0 +

1 . 2ψβ

(4.4)

454

Sean B. Holden and Mahesan Niranjan

For the case of an RBFN learning a linear target rule, a calculation exactly analogous to that given for nonlinear target rules leads again to equation 4.4. The learning curve described by equation 4.4 is similar to the learning curve for a standard linear perceptron. To some extent, this might be expected, as when the basis functions are fixed, training corresponds to the training of a linear perceptron in a nonlinearly “extended” space. (We are in fact training a linear perceptron using a set of inputs corresponding to the outputs of the basis functions; see, for example, Holden and Rayner 1995.) However, it is interesting in the light of the following facts: First, the network as a whole is nonlinear. Second, as mentioned in Subsection 2.2, the result can be regarded as an extension of earlier results to certain situations where inputs are not gaussian-distributed. (See also Subsection 5.2 below.) 5 Discussion 5.1 Using the Results in Practice. The initial aim of this work was to extend the use of techniques from statistical physics to networks of a type that is often applied in practice. Until Section 4, our analysis applied to completely general RBFNs for which all available parameters are adapted during training. The results obtained are therefore potentially useful in any further analysis of RBFNs using this formalism. In order to obtain actual learning curves, it was necessary in Section 4 to restrict our area of interest to more limited RBFNs, for which only the parameters of the vector α are adapted during training. It was also necessary to use the high-temperature limit. These issues are now discussed, with an emphasis on the way in which they affect the practical applicability of the results obtained. 5.1.1 The Use of Fixed Basis Functions. It would clearly be desirable to extend these results to the case in which basis function parameters are chosen with reference to the available training examples, as this is in general the case in practice. At present, the need to make this restriction limits to some extent the practical applicability of the results, although we can reasonably expect that they will be applicable to some problems. Note that this restriction is also used in related work in this area (Freeman & Saad 1995) that is discussed below. 5.1.2 The High-Temperature Limit. In using the high-temperature limit, we obtain relatively fast training but at the expense of requiring a large number of training examples. (By increasing ψ, we make the effective temperature T/ψ small. See Sompolinsky et al. 1990 for a more detailed discussion of this issue.) The results obtained are therefore most suited to situations where the number of training examples available is relatively large. Note, however, that in using the high-temperature limit, we are in fact using an approximation—rather than an exact analysis—that can fail in some cir-

Average-Case Learning Curves for Radial Basis Function Networks

455

cumstances. In particular, it should be emphasized that the main results of this article require that T be large and that ψ—and hence the number of training examples—be large. (See also the comments on experimental work below.) The high-temperature limit applies when T → ∞, ψ → ∞, and ψβ is constant, and the question of how the results of Section 4 should be interpreted is therefore of interest. In Sompolinsky et al. (1990) and Schwarze et al. (1992), the high-temperature limit is used to analyze perceptrons and committee machines, and expected generalization error is plotted against the value ψβ. Experiments that are performed show that reasonable agreement with the theoretical results for the high-temperature limit can be obtained using a fixed temperature of T = 5. We can therefore consider a fixed, high T and rewrite equation 4.4 as ²g (P, T) = ²0 +

Tq . 2P

(5.1)

Equation 5.1 now shows us the manner in which ²g (P, T) decreases— approaching its minimum value ²0 —as the number P of training examples is increased.1 5.1.3 The Large σα Assumption. In deriving the learning curve in equation 4.4, it was assumed that σα is large in order to simplify the form of the matrix Q. This is equivalent to assuming that no constraints are being placed on the ranges of the weights in α and consequently does not appear to affect significantly the practical applicability of our results as weights may in practice be constrained only by the limits of the hardware on which a network is implemented. Should this assumption prove to be untenable, the more general expression for the learning curve (for general values of σα ) can be obtained as demonstrated in Appendix B. 5.2 Related Work. Freeman and Saad (1995) have studied RBFNs within a similar framework to that used here, with similar initial aims. The most significant difference between their approach and ours is in the manner in which quenched averages are dealt with. Averages of the form of equation 2.4 can present significant difficulties in that it is not always possible to evaluate the relevant integral. In this article, this problem is dealt with as a consequence of studying the high-temperature limit, which removes the need to deal explicitly with the quenched average. Freeman and Saad, on 1 Clearly this requires the assumption that taking a sufficiently high value of T is sufficient to allow practical results in reasonable agreement with the theory to be obtained, as was the case in the studies cited. This remains a subject for future work; however, there appears to be no reason to suppose that it is not the case.

456

Sean B. Holden and Mahesan Niranjan

the other hand, address the problem by introducing some assumptions regarding the basis functions, along with a further approximation. Neither the assumptions nor the approximation are required in the approach used here. Note also that Freeman and Saad’s work shares with ours the assumption that basis functions are fixed, as discussed above. Many other approaches based on techniques from statistical physics have been developed for the analysis of connectionist networks. In particular, it may be possible to obtain some of the results presented here more easily and perhaps with greater generality using alternative approaches, and in particular the “smooth networks” approach (see, for example, Sompolinsky et al. 1990 and Seung et al. 1992 for information regarding smooth networks and other relevant work).

5.3 Future Research. Areas for possible future research include the extension of our results to fully general RBFNs and to cases in which the training data are noisy, and experimental studies to investigate the hightemperature limit, as suggested above. Experimental studies would also be useful in clarifying the precise circumstances in which the high-temperature limit is useful in practice. A further area for research is the analysis of RBFNs using more sophisticated and general techniques, such as the annealed approximation and replica methods (see Seung et al. 1992).

6 Conclusion Statistical physics has made significant contributions to the study of learning curves for feedforward neural networks; however, much of the research to date applies to networks that are not used in practice to any significant extent. We have used the formalism provided by statistical physics to investigate the learning curves for gaussian radial basis function networks, which have been used in many practical applications with excellent results. We have derived general expressions for the generalization error of a network with a given set of parameters learning nonlinear and linear, realizable and unrealizable rules in the absence of noise. We then obtained learning curves for a more restricted class of networks trained using a stochastic algorithm in the high-temperature limit. The assumption that basis functions are fixed, which was required in order to obtain our learning curves, limits to some extent the practical applicability of these results, and an important area for further study is the investigation of RBFNs in which the basis functions are not fixed. Three further possible areas for future research are the experimental investigation of the results obtained using the high-temperature limit, the analysis of fully general networks in the high-temperature limit, and the analysis of RBFNs using more sophisticated techniques, such as the annealed approximation and the replica methods.

Average-Case Learning Curves for Radial Basis Function Networks

457

Appendix A: Useful Integrals We make extensive use in this article of the standard identity, ´¸ 1³ T T exp − x Xx + x y + z dx 2 µ ¶¸ · yT X−1 y 1 p/2 −1/2 z− , exp − = (2π) | X | 2 4 ·

Z

(A.1)

where X is a constant, symmetric matrix; y and z are constants; and p is the dimension of x. A derivation of this result can be found in Godsill (1993). We also make use of the identity, · ´¸ 1³ aT x exp − xT Xx + xT y + z dx 2 ´ ³ 1 = − (2π)p/2 | X |−1/2 aT X−1 y 2 µ ¶¸ · yT X−1 y 1 z− , × exp − 2 4 Z

I=

(A.2)

where X is again a symmetric matrix and a is a further constant vector. This is not, as far as we are aware, a standard result. It can be derived from equation A.1 using Leibnitz’s rule (see Kaplan 1981) by noting that I=

p X i=1

Z ai

· ´¸ 1³ xi exp − xT Xx + xT y + z dx 2

and ½ · ´¸¾ ∂ 1³ exp − xT Xx + xT y + z ∂yi 2 · ´¸ 1³ 1 = − xi exp − xT Xx + xT y + z . 2 2

(A.3)

Finally, we use the identity Z xT Xx exp[−xT Yx] dx =

π p/2 Trace(Y−1 X) , | Y |1/2 2

(A.4)

´ Ruanaidh where Y is constant and nonsingular. This result was derived by O and Fitzgerald (1993).

458

Sean B. Holden and Mahesan Niranjan

Appendix B: Obtaining the Full Expression for ²g (P, T) Consider again the equation ²g (P, T) = −

· ¸ 1 1 ∂ ³ 2 2 T −1 ´ 1 ∂ ψ β n Q n − ln | Q | −ψc1 , P 2 ∂β 2 ∂β

(B.1)

(equation 4.3), which provides an expression for ²g (P, T) that can be used to obtain the learning curve for the high-T limit for general values of σα . In this case we have, Q = ψβM + σα−2 I (equation 4.2) and the main task is to deal with the two partial differentiations. This can be done using the results, ∂ ln |Q| = (Q−1 )ij ∂(Q)ij

(B.2)

and   ∂Q−1  =  ∂(Q)ij

∂(Q−1 )11 ∂(Q)ij

.. .

−1

∂(Q )k1 ∂(Q)ij

··· .. .

∂(Q−1 )1k ∂(Q)ij

···

∂(Q )kk ∂(Q)ij

.. .

−1

    = −Q−1 1ij Q−1 , 

(B.3)

where for any matrix A, (A)ij denotes the element at the ith row and jth column of A, and 1ij denotes the matrix having all elements zero with the exception that (1ij )ij = 1. (See Rogers 1980 for results of this type. It is assumed that relevant conditions such as the existence of Q−1 etc. are met.) Now, X ∂ ln |Q| ∂(Q)ij ∂ ln |Q| = ∂β ∂(Q)ij ∂β ij X (Q−1 )ij (M)ij . =ψ

(B.4)

ij

The remaining expression can also be differentiated as follows: ∂ h 2 2i ∂ h T −1 i ∂ h 2 2 T −1 i ψ β n Q n = ψ 2β 2 n Q n + nT Q−1 n ψ β ∂β ∂β ∂β   X 2 2 ∂  −1  ni nj (Q )ij + 2ψ 2 βnT Q−1 n =ψ β ∂β ij

Average-Case Learning Curves for Radial Basis Function Networks

= ψ 2β 2

X

ni nj

ij

459

X ∂(Q−1 )ij ∂(Q)rs ∂(Q)rs ∂β rs

+2ψ βnT Q−1 n   X X 2 2 −1 −1 ni nj (−Q 1rs Q )ij ψ(M)rs  =ψ β 2

rs

ij 2

T

−1

+2ψ βn Q n . (B.5) Substituting equations B.4 and B.5 into equation B.1 gives the full expression for ²g (P, T). These techniques also allow ²g (P, T) to be analyzed for general values of σα for the case of linear target rules. Acknowledgments This research was supported by Science and Engineering Research Council research grant GR/H16759. We thank Marc M´ezard and Sebastian Seung for their useful comments on this work, and we also thank the anonymous referees for several helpful suggestions. In particular, we thank a referee for pointing out equations B.2 and B.3 to us. References Amit, Daniel J. (1989). Modeling brain function: The world of attractor neural networks. Cambridge: Cambridge University Press. Anthony, M., & Biggs, N. (1992). Computational learning theory. Cambridge Tracts in Theoretical Computer Science. Cambridge: Cambridge University Press. Anthony, M., & Holden, S. B. (1994). Quantifying generalization in linearly weighted neural networks. Complex Systems, 8, 91–114. Barkai, E., Hansel, D., & Sompolinsky, H. (1992, March). Broken symmetries in multilayered perceptrons. Physical Review A, 45, 4146–4161. Broomhead, D. S., & Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2, 321–355. Freeman, J. A. S., & Saad, D. (1995). Learning and generalization in radial basis function networks. Neural Computation, 7, 945–965. Godsill, S. (1993). The restoration of degraded audio signals. Unpublished doctoral dissertation, Cambridge University, Cambridge, U.K. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Santa Fe Institute Studies in the Sciences of Complexity. AddisonWesley, Reading, MA. Holden, S. B., & Rayner, P. J. W. (1995, March). Generalization and PAC learning: Some new results for the class of generalized single layer networks. IEEE Transactions on Neural Networks, 6(2), 368–380.

460

Sean B. Holden and Mahesan Niranjan

Kaplan, W. (1981). Advanced mathematics for engineers. Reading, MA: AddisonWesley. Landau, L. D.& Lifshitz, E. M. (1980). Statistical physics: Course of theoretical physics (3rd ed.). New York: Pergamon Press. Lee, W. S., Bartlett, P. L., & and Williamson, R. C. (1994). Lower bounds on the VC dimension of smoothly parameterized function classes. In Proceedings of the 7th Annual Workshop on Computational Learning Theory (pp. 362–367), New York: ACM Press. Levin, E., Tishby, N., & Solla, S. A. (1990, October). A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, 78, 1568–1574. Moody, J., & Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Niranjan, M., & Fallside, F. (1990, July). Neural networks and radial basis functions in classifying static speech patterns. Computer Speech and Language, 4, 275–289. Also available as Cambridge University Engineering Department Report CUED/F-INFENG/TR 22 (1988). Powell, M. J. D. (1985, October). Radial basis functions for multivariable interpolation: A review. (Tech. Rep. DAMTP 1985/NA12). Cambridge: Department of Applied Mathematics and Theoretical Physics, University of Cambridge. Powell, M. J. D. (1990, December). The theory of radial basis function approximation in 1990. (Tech. Rep. DAMTP 1990/NA11). Cambridge: Department of Applied Mathematics and Theoretical Physics, University of Cambridge. ´ Ruanaidh, J. J. K., & Fitzgerald, W. J. (1993). The restoration of digital audio O recordings using the Gibbs sampler. (Tech. Rep. CUED/F-INFENG/TR.153). Cambridge: Engineering Department, Cambridge University. Renals, S., & Rohwer, R. (1989, June). Phoneme classification experiments using radial basis functions. In Proceedings of the International Joint Conference on Neural Networks, pp. I-461–I-467. Rogers, G. S. (1980). Matrix derivatives. Lecture Notes in Statistics, Vol. 2. New York: Marcel Dekker. Schwarze, H., Opper, M., & Kinzel, W. (1992). Generalization in a two-layer neural network. Physical Review A, 46, R6185–R6188. Seung, H. S., Sompolinsky, H., & Tishby, N. (1992, April). Statistical mechanics of learning from examples. Physical Review A, 45, 6056–6091. Sompolinsky, H., Tishby, N., & Seung, H. S. (1990). Learning from examples in large neural networks. Physical Review Letters, 65, 1683–1686. Watkin, T. L. H., Rau, A., & Biehl, M. (1993, April). The statistical mechanics of learning a rule. Rev. Mod. Phys., 65, 499–556.

Communicated by Scott Fahlman

A Sequential Learning Scheme for Function Approximation Using Minimal Radial Basis Function Neural Networks Lu Yingwei N. Sundararajan P. Saratchandran School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore

This article presents a sequential learning algorithm for function approximation and time-series prediction using a minimal radial basis function neural network (RBFNN). The algorithm combines the growth criterion of the resource-allocating network (RAN) of Platt (1991) with a pruning strategy based on the relative contribution of each hidden unit to the overall network output. The resulting network leads toward a minimal topology for the RBFNN. The performance of the algorithm is compared with RAN and the enhanced RAN algorithm of Kadirkamanathan and Niranjan (1993) for the following benchmark problems: (1) hearta from the benchmark problems database PROBEN1, (2) Hermite polynomial, and (3) Mackey-Glass chaotic time series. For these problems, the proposed algorithm is shown to realize RBFNNs with far fewer hidden neurons with better or same accuracy. 1 Introduction Radial basis function neural networks (RBFNN) are well suited for function approximation and pattern recognition due to their simple topological structure and their ability to reveal how learning proceeds in an explicit manner (Tao 1993). In the classical approach to radial basis function (RBF) network implementation, the basis functions are usually chosen as gaussian and the number of hidden units (i.e., centers and widths of the gaussian functions) is fixed a priori based on the properties of input data. The weights connecting the hidden and output units are estimated by a linear least squares method, as in the least mean square (LMS) method (Moody & Darken 1989; Musavi et al. 1992). The disadvantage with this approach is that it is not suitable for sequential learning and usually results in too many hidden units (Bors & Gabbouj 1994). A significant contribution that overcomes these drawbacks was made by Platt (1991) through the development of an algorithm that adds hidden units to the network based on the novelty of the new data. The algorithm Neural Computation 9, 461–478 (1997)

c 1997 Massachusetts Institute of Technology °

462

Lu Yingwei, N. Sundararajan, and P. Saratchandran

is suitable for sequential learning and is based on the idea that the number of hidden units should correspond to the complexity of the underlying function as reflected in the observed data. The resulting network, called a resource-allocating network (RAN), starts with no hidden units and grows by allocating new hidden units based on the novelty in the observations that arrive sequentially. If an observation has no novelty, then the existing parameters of the network are adjusted by an LMS algorithm to fit that observation. An interpretation of Platt’s RAN from a function space approach was provided by Kadirkamanathan and Niranjan (1993), who also enhanced its performance by adopting an Extended Kalman Filter (EKF) instead of the LMS method for adjusting network parameters. They called the resulting network a Resource-Allocating Network via Extended Kalman Filter (RANEKF) and showed its superior performance to the RAN for the tasks of function approximation and time-series prediction (Kadirkamanathan & Niranjan 1993). One drawback of both RAN and RANEKF is that once a hidden unit is created, it can never be removed. Because of this, both RAN and RANEKF could produce networks in which some hidden units, although active initially, may subsequently end up contributing little to the network output. If such inactive hidden units can be detected and removed as learning progresses, then a more parsimonious network topology can be realized. This has been the motivation for us to combine pruning to the RANEKF scheme. The following function approximation problem clearly illustrates how RANEKF could produce more numbers of hidden units than required. 1.1 Function Approximation Problem. In this problem, the RANEKF has to approximate a nonlinear function (Chen et al. 1993), which is a linear combination of six gaussian functions: y(x) =

+(x2 −0.2) +(x2 −0.2) ] + exp[− (x1 −0.7)0.01 ] exp[− (x1 −0.3)0.01 2

2

2

2

+(x2 −0.5) +(x2 −0.5) ] + exp[− (x1 −0.9)0.02 ] + exp[− (x1 −0.1)0.02 2

2

2

2

+(x2 −0.8) +(x2 −0.8) ] + exp[− (x1 −0.7)0.01 ]. + exp[− (x1 −0.3)0.01 2

2

2

2

(1.1)

Theoretically we can say that to approximate this function, the RANEKF network should need only six hidden neurons. To test this, training samples are randomly chosen in the interval (0, 1) and the observations sequentially presented to RANEKF. Figure 1 shows the results of applying the RANEKF algorithm to this problem. From Figure 1a, we can see the number of hidden neurons in the RANEKF network increasing with the number of observations and the network’s finally settling to 15 hidden units instead of the desired 6 hidden units. Thus the RANEKF overfits and so the network realized by this method is not minimal. Figure 1b shows the desired 6 centers (+) and the 15 centers produced by RANEKF (◦).

A Sequential Learning Scheme for Function Approximation

463

Figure 1: Number and positions of hidden unit centers achieved by RANEKF for the static function approximation problem.

464

Lu Yingwei, N. Sundararajan, and P. Saratchandran

Those hidden units in the network, which are far away from the desired centers, will have a significant output value only when the input is near their center. But for approximating the function of equation 1.1, a zero output is required for all inputs that are far from the desired centers. Therefore, the circles in Figure 1b, which are far from the desired centers, end up making little contribution to the network’s outputs and should be removed from the network. We notice that corresponding to the desired centers (0.3,0.8), (0.7,0.8), (0.9,0.5), and (0.3,0.2), there are two overlapping circles. The two hidden units with the same center will provide no more information than one does, so one of them is superfluous and ought to be pruned. Generally, all those hidden units that contribute little to the network output are superfluous. In this article we propose an algorithm that adopts the basic idea of RANEKF and augments it with a pruning strategy to obtain a minimal RBFNN network. The pruning strategy removes those hidden neurons that consistently make little contribution to the network output and together with an additional growth criterion ensures that the transition in the number of hidden neurons is smooth. Although theoretical proofs showing that the resulting RBFNN topology is in some sense minimal is still under investigation, we have referred it as a minimal RAN (M-RAN) based on detailed simulation studies carried out on a number of benchmark problems (Yingwei et al. 1995). For all those problems, the proposed approach produced a network with fewer hidden neurons than both the RAN and the RANEKF with the same or smaller approximation errors. The article is organized as follows. Section 2 describes the the M-RAN algorithm. Sections 2.1 and 2.2 present the basic strategy for growth and pruning. The performance of M-RAN for the static function approximation problem of equation 1.1 is given in Section 2.3. An additional growth criterion based on the output error over a sliding window is introduced in Section 2.4 to smooth the transitions in the growth and pruning. Section 2.5 gives the final algorithm and its performance for the problem of equation 1.1. In Section 3 the performance of M-RAN is compared with RAN and RANEKF for two static function approximation problems (PROBEN1hearta and Hermite polynomial) and the dynamic Mackey-Glass chaotic time-series prediction problem. Finally, conclusions are summarized in Section 4. 2 Proposed M-RAN Algorithm In this section, a description of the M-RAN algorithm is presented. The notations and symbols used here are the same as those in Kadirkamanathan and Niranjan (1993). 2.1 Basic Growth Strategy. The basic growth strategy used in our algorithm is the same as that of RAN and RANEKF. The output of the network

A Sequential Learning Scheme for Function Approximation

465

for all three algorithms (RAN, RANEKF, and M-RAN) has the following form: f (x) = α0 +

K X

αk φk (x) ,

(2.1)

k=1

where φk (x) is the response of the kth hidden neuron to the input x and αk is the weight connecting the kth hidden unit to the output unit. α0 is the bias term. Here, K represents the number of hidden neurons in the network. φk (x) is a gaussian function given by Ã

! 1 2 φk (x) = exp − 2 kx − µk k , σk

(2.2)

where µk is the center and σk is the width of the gaussian function. kk denotes the Euclidean norm. The learning process of M-RAN involves allocation of new hidden units as well as adjusting network parameters. The network begins with no hidden units. As observations are received, the network grows by using some of them as new hidden units. The following two criteria must be met for an observation (xn , yn ) to be used to add a new hidden unit to the network: kxn − µnr k > ²n .

(2.3)

en = yn − f (xn ) > emin ,

(2.4)

where µnr is the center (of the hidden unit) closest to xn . ²n and emin are thresholds to be selected appropriately. When a new hidden unit is added to the network, the parameters associated with the unit are αK+1 = en , µK+1 = xn , σK+1 = κkxn − µnr k. κ is an overlap factor, which determines the overlap of the responses of the hidden units in the input space. When the observation (xn , yn ) does not meet the criteria for adding a new hidden unit, the network parameters w = [α0 , α1 , µ1 T , σ1 , . . . , αK , µK T , σK ]T are adapted using the EKF as follows: wn = w(n−1) + en kn .

(2.5)

kn is the Kalman gain vector given by kn = [Rn + an T Pn−1 an ]−1 Pn−1 an .

(2.6)

466

Lu Yingwei, N. Sundararajan, and P. Saratchandran

an is the gradient vector and has the following form: · 2α1 2α1 an = 1, φ1 (xn ), φ1 (xn ) 2 (xn − µ1 )T , φ1 (xn ) 3 kxn − µ1 k2 , . . . , σ1 σ1 2αK 2αK φK (xn ), φK (xn ) 2 (xn − µK )T , φK (xn ) 3 kxn − µK k2 σK σK

¸T .

(2.7)

Rn is the variance of the measurement noise. Pn is the error covariance matrix, which is updated by Pn = [I − kn an T ]Pn−1 + QI ,

(2.8)

where Q is a scalar that determines the allowed random step in the direction of gradient vector.1 If the number of parameters to be adjusted is N, Pn is an N × N positive definite symmetric matrix. When a new hidden unit is allocated, the dimensionality of Pn increases to µ Pn =

Pn−1 0

0 P0 I

¶ ,

(2.9)

and the new rows and columns are initialized by P0 , which is an estimate of the uncertainty in the initial values assigned to the parameters. The dimension of identity matrix I is equal to the number of new parameters introduced by the new hidden unit. 2.2 Pruning Strategy. This section presents a scheme to prune superfluous hidden units in the M-RAN network. To remove the hidden units that make little contribution to the network output, first consider the output ok of the hidden unit k: Ã ! kx − µk k2 . (2.10) ok = αk exp − σk2 If αk or σk in equation 2.10 is small, ok might become small. Alternately if kx − µk k is large, which means the input is far from the center of this hidden unit, the output becomes small. In order to decide whether a hidden neuron should be removed, the output values (ok ) for the hidden units are examined continuously. If the output of a hidden unit is less than a threshold over a number of M consecutive inputs, then that hidden unit is removed from 1 According to Kadirkamanathan and Niranjan (1993), the rapid convergence of the EKF algorithm may prevent the model from adapting to future data, and QI is introduced to avoid this problem.

A Sequential Learning Scheme for Function Approximation

467

the network.2 Since the use of absolute values can cause inconsistency in pruning, the outputs of hidden units are normalized, and the normalized output values are used in the pruning criterion. This pruning strategy is expressed as follows: • For every observation (xn ,yn ), compute the outputs of all hidden units onk (k = 1, . . . , K), using equation 2.10. • Find the largest absolute hidden unit output value konmax k. Compute the normalized output values rnk , (k = 1, on

. . . , K), for the hidden units where rnk = k onk k. max

• Remove the hidden units for which the normalized output is less than a threshold δ for M consecutive observations. • Adjust the dimensionality of the EKF to suit the reduced network. 2.3 Performance of the Pruning Strategy for Function Approximation. The pruning strategy is added to the basic strategy of Section 2.1, and the resulting algorithm is tested on the function approximation problem of equation 1.1. Figure 2 shows the results of the algorithm. It can be clearly seen that the proposed pruning strategy helps to reduce the number of hidden neurons to 6 after 4500 samples. It can also be observed that the algorithm causes the number of hidden neurons to increase by 13 initially (0–600 samples), after which pruning begins to take effect and brings the number of hidden units to 7 at around 1000 samples, and stays with 7 until 4500 samples before reaching the correct value of 6 hidden units. This shows that pruning helps to realize a minimal RBFNN, although the transitions in the number of hidden neurons are not always smooth. Also the minimum network was realized after 4500 samples, which is rather long. This motivated us to include an extra criterion for growth to smooth the transitions. 2.4 Sliding Window RMS Criterion for Adding Hidden Neurons. Besides the novelty criteria given by equations 2.3 and 2.4, we include a third criterion, which uses the root mean square (RMS) value of the output error over a sliding data window before adding a hidden unit. The RMS value of network output error at nth observation ermsn is given by sP ermsn =

2

n 2 i=n−(M−1) (ei )

M

,

For detailed analysis of this pruning criterion, refer to Yingwei et al. (1995).

(2.11)

468

Lu Yingwei, N. Sundararajan, and P. Saratchandran

Figure 2: Number and positions of hidden unit centers obtained with pruning strategy on static function approximation problem.

A Sequential Learning Scheme for Function Approximation

469

where ei = yi − f (xi ). In equation 2.11, for each observation, a sliding data window is employed to include a certain number of output errors based on yn , yn−1 , . . . , yn−(M−1) . When a new observation arrives, the data in this window are updated by including the latest data and eliminating the oldest. For example, at the nth observation, the data window includes en , en−1 , . . . , en−(M−1) ; when the (n + 1)th observation arrives, the data window will contain en+1 , en , . . . , en−(M−2) . In this sense, it seems that this data window slides along with the data obtained on-line; thus we call it a data sliding window. The size of the window is determined by the value of M—that is, the number of samples used to calculate the RMS error. Thus, the extra growth criterion to be satisfied is ermsn > e0min , where e0min is a threshold value to be selected. 2.5 The Final M-RAN Algorithm. In this section, the final M-RAN sequential learning algorithm is presented. Algorithm. For each input xn , compute: Ã ! K X 1 2 φk (xn ) = exp − 2 kxn − µk k , f (xn ) = α0 + αk φk (xn ), σk k=1 ²n = max{²max γ n , ²min }, where 0 < γ < 1 is a decay constant.3 sP n 2 i=n−(M−1) [yi − f (xi )] . en = yn − f (xn ), ermsn = M If en > emin , and kxn −µnr k > ²n , and ermsn > e0min , then allocate a new hidden unit with αk+1 = en , and µk+1 = xn , and σk+1 = κkxn − µnr k. Else

wn = w(n−1) + kn en , kn = [Rn + an T Pn−1 an ]−1 Pn−1 an , Pn = [I − kn an T ]Pn−1 + QI. Compute the outputs of all hidden units onk , (k = 1, . . . , K). Find the largest absolute hidden unit output value konmax k. Calculate the normalized value for each hidden unit ° n ° ° ok ° n ° rk = ° ° on ° , (k = 1, . . . , K) max

if rnk < δ for M consecutive observations, then prune the kth hidden neuron and reduce the dimensionality of Pn to fit for the requirement of EKF. 3

The value for ²n is decayed until it reaches ²min .

470

Lu Yingwei, N. Sundararajan, and P. Saratchandran

This final algorithm is tested on the function approximation problem of section 1.1, and the results are presented in Figure 3. From Figure 3, it can be seen that the 6 hidden units on the desired positions are obtained within 600 samples. Also, the transitions during the build-up and drop-off of the hidden neurons are smooth compared to Figure 2. 3 Performance of the Algorithm on Benchmark Problems In this section, the sequential learning strategy described in Section 2 is applied to three benchmark problems. The first two are static function approximation problems, and the third is a time-series prediction problem. The first problem is taken from the benchmark problems PROBEN1 (Prechelt 1994), set up recently and available from http://www.ipd.ira.uka.de/∼ prechelt/ NIPS bench.html. The problem selected for this study is the hearta data set, which is based on the “heart disease” diagnosis data sets from the University of California, Irvine, machine learning data set. The second problem is to approximate the Hermite polynomial, and the data in this case are randomly sampled consistent with the underlying function and presented sequentially to the network. The third example is on-line prediction of a chaotic time series given by the Mackey-Glass equation where an underlying function for the data does not exist. For all these problems, the performance of our algorithm is compared with the RANEKF and the RAN. 3.1 Function Approximation Problem: PROBEN1-hearta Benchmark. The hearta data set consists of 920 examples (690 training and 230 test examples) with each example described by 13 input attributes and 1 output attribute. Out of the 920 examples, only 299 are complete; the rest have missing values. This makes hearta a difficult function approximation problem. Our networks for RAN, RANEKF, and M-RAN used 13 input units and a single continuous output to represent the five different levels of 1 output attribute. According to the guidelines given in Prechelt (1994), the missing values were replaced with the mean of the nonmissing values for that attribute. The training data for the networks consist of both the training set and validation set, and the approximation error is calculated over the test data provided in the data set. The performance of the three algorithms is evaluated on the three different permutations of the hearta data set— hearta1, hearta2, and hearta3—available in PROBEN1. The values for the various parameters used in this experiment are as follows: ²max = 4.0, ²min = 1.5, γ = 0.99, emin = 0.1, κ = 0.4, P0 = Rn = 1.0, Q = 0.00001, and η = 0.01. The threshold value δ = 0.00001, the size of sliding data window M = 60, and the threshold value of the RMS error in the additional growth criterion e0min = 0.25. Figure 4a shows the growth pattern for the three networks as they learn sequentially from the training data. Figure 4b shows the approximation error, measured by the given squared error percentage (E) in Prechelt (1994),

A Sequential Learning Scheme for Function Approximation

471

Figure 3: Number and positions of hidden unit centers achieved with M-RAN.

472

Lu Yingwei, N. Sundararajan, and P. Saratchandran

Figure 4: Performance of RAN, RANEKF, and M-RAN for hearta1.

A Sequential Learning Scheme for Function Approximation

473

Table 1: Performance of Different Networks for hearta1, hearta2, and hearta3.

Benchmark data

hearta1

hearta2

hearta3

Network

Number of hidden neurons

Squared error percentage for testing set

Number of training epochs

M-RAN RANEKF RAN BP

8 13 13 32

4.14 4.21 5.20 4.55

1 1 1 47

M-RAN RANEKF RAN BP

9 11 11 16

3.81 3.81 5.71 4.33

1 1 1 54

M-RAN RANEKF RAN BP

7 11 14 32

3.80 3.91 5.63 4.89

1 1 1 46

over the test data. E is expressed as follows: E = 100 ∗

N P X omax − omin X (opi − tpi )2 , N ∗ P p=1 i=1

(3.1)

where omin and omax are the minimum and the maximum values of the output coefficients in the problem. N is the number of output nodes of the network, and P is the number of patterns (examples) in the data set considered. opi and tpi are the actual and desired output values at the ith output node and pth pattern, respectively. Table 1 shows the results from the three algorithms together with the performance of multilayer backpropagation (BP) networks for pivot architectures (Prechelt 1994). It can be seen from the table that our algorithm achieves the same or a lower approximation error than RAN and RANEKF and BP with fewer hidden units. The values quoted for pivot architectures are from Table 11 of Prechelt (1994). 3.2 Function Approximation Problem: Hermite Polynomial. In this section, results for approximating the Hermite Polynomial are presented. The Hermite Polynomial is given by the following equation: µ ¶ 1 f (x) = 1.1(1 − x + 2x2 ) exp − x2 . 2

(3.2)

A random sampling of the interval [−4, 4] is used in obtaining 40 inputoutput data for the training set. All parameter values are chosen the same as

474

Lu Yingwei, N. Sundararajan, and P. Saratchandran

those in Kadirkamanathan and Niranjan (1993)—that is, ²max = 2.0, ²min = 0.2, γ = 0.977, emin = 0.02, κ = 0.87, P0 = Rn = 1.0, Q = 0.02, η = 0.05. The additional parameters required for our algorithm are as follows: the threshold value δ = 0.01, the size of sliding data window M = 25, and the threshold value of the RMS error in the additional growth criterion e0min = 0.3. Figure 5 shows the results from the three algorithms. It can be seen from the figure that our algorithm achieves a slightly lower approximation error, measured by the RMS error4 calculated over 200 uniformly sampled data, than RANEKF with only 7 hidden units compared to the 13 hidden units of RANEKF. RAN needed 18 hidden units and achieved an RMS error of nearly an order of magnitude larger than both our algorithm and the RANEKF. With no noise in the training set, the M-RAN needed only about half the number of hidden units of RANEKF. The effect of noise on the performance of the three algorithms was analyzed by adding different levels of gaussian white noise to the training patterns, and here too M-RAN always realized more compact networks with smaller approximation errors (Yingwei et al. 1995). As an example, when the noise variance was 0.07, the number of hidden units for M-RAN was 11, for RANEKF 40, and for RAN 20. 3.3 Prediction of Chaotic Time Series. The chaotic time series used is the Mackey-Glass series, which is governed by the following time-delay ordinary differential equation: s(t − τ ) ds(t) = −bs(t) + a , dt 1 + s(t − τ )10

(3.3)

with a = 0.2, b = 0.1, and τ = 17. Integrating the equation over the time interval [t, t + 1t] by the trapezoidal rule yields x(t + 1t) =

2 − b1t x(t) 2 + 1t · ¸ x(t + 1t − τ ) x(t − τ ) a1t + . + 2 + b1t 1 + x10 (t + 1t − τ ) 1 + x10 (t − τ )

(3.4)

Setting 1t = 1, the time series is generated under the condition x(t−τ ) = 0.3 for 0 ≤ t ≤ τ (τ = 17). The initial 4000 data points are discarded for the initialization transients to decay.5 The series is predicted with ν = 50 sample steps ahead using four past samples: sn−ν , sn−ν−6 , sn−ν−12 , sn−ν−18 . Hence, the nth input-output data for 4

RMSE is used for comparison to be consistent with Kadirkamanathan and Niranjan (1993). 5 These conditions are the same as those in Platt (1991).

A Sequential Learning Scheme for Function Approximation

475

Figure 5: Performance of RAN, RANEKF, and M-RAN for Hermite Polynomial without noise.

476

Lu Yingwei, N. Sundararajan, and P. Saratchandran

the network to learn are xn = [sn−ν , sn−ν−6 , sn−ν−12 , sn−ν−18 ]T , yn = sn whereas the step-ahead predicted value at time n − ν is given by zn = f (n−ν) (xn ),

(3.5)

where f (n−ν) (xn ) is the network at time (n − ν). The ν step-ahead prediction error is ²n = sn − zn .

(3.6)

The exponentially weighted prediction error (WPE) is chosen as a measure of the performance of the network prediction. WPE at time (n) can be recursively computed as WPE(n) = λWPE(n−1) + (1 − λ)k²n k2 , 2

2

(3.7)

where 0 < λ < 1. Here λ is chosen as λ = 0.95. The parameter values are selected as follows: ²max = 0.7, ²min = 0.07, γ = 0.999, emin = 0.05, κ = 0.87, P0 = Rn = 1.0, Q = 0.0002, η = 0.02, δ = 0.0001; M = 90; and e0min = 0.04. Figure 6c displays the approximation curve of the M-RAN during the sample time between 4500 and 5000. The desired output curve is plotted with a slash-dotted line, while the output of the minimal network is plotted with a solid line. This figure shows that the approximation curve of our MRAN network fits the desired curve at most places. The number of hidden units and the network performance are shown in Figures 6a and 6b, respectively. In this case, the value of WPE is almost same for all three algorithms, while the M-RAN needed only 29 hidden units, compared with 39 hidden units achieved by RANEKF, and 81 hidden units of RAN. 4 Conclusions In this article, a sequential learning algorithm for realizing a minimal RBF neural network is developed. This algorithm augments the growth criteria of RANEKF and combines with it a scheme to remove those hidden units that make insignificant contributions to the output of the network. Using benchmark problems covering both static (hearta, Hermite Polynomial) and dynamic (prediction for a chaotic time series) approximations, the algorithm developed is shown to produce an RBF neural network with smaller complexity than both RANEKF and RAN, with equal or better approximation accuracy.

A Sequential Learning Scheme for Function Approximation

477

Figure 6: Performance of RAN, RANEKF, and M-RAN for Mackey-Glass timeseries prediction problem.

478

Lu Yingwei, N. Sundararajan, and P. Saratchandran

Acknowledgments We thank the anonymous reviewers for their constructive comments and also for pointing out the PROBEN1 benchmark database. References Bors, A. G., & Gabbouj, M. (1994). Minimal topology for a radial basis functions neural network for pattern classification. Digital Signal Processing, 4, 173–188. Chen, C., Chen, W., & Cheng, F. (1993). Hybrid learning algorithm for Gaussian potential function network. IEE Proceedings-D, 140, 442–448. Kadirkamanathan, V., & Niranjan, M. (1993). A function estimation approach to sequential learning with neural network. Neural Computation, 5, 954–975. Moody, J., & Darken, C. J. (1989). Fast learning in network of locally-tuned processing units. Neural Computation, 1, 281–294. Musavi, M., Ahmed, W., Chan, K., Faris, K., & Hummels, D. (1992). On training of radial basis function classifiers. Neural Networks, 5, 595–603. Platt, J. (1991). A resource allocating network for function interpolation. Neural Computation, 3, 213–225. Prechelt, L. (1994). Proben1—a set of neural network benchmark problems and benchmarking rules. (Tech. Rep. 21/94). Fakultat fur Informatik, University Karlsruhe. Tao, K. (1993). A closer look at the radial basis function (RBF) networks. In A. L. A. Singh (Ed.), Conference Record of 27th Asilomar Conference on Signals, System and Computers, pp. 401–405. New York: IEEE. Yingwei, L., Sundararajan, N., & Saratchandran, P. (1995). A sequential learning scheme for fuction approximation using minimal radial basis function (RBF) neural networks. (Tech. Rep. CSP/9505). Singapore: Center for Signal Processing, School of Electrical and Electronic Engineering, Nanyang Technological University.

Received December 11, 1995; accepted June 21, 1996.

ARTICLE

Communicated by Andrew Barto and Steven Nowlan

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control: Analysis and Applications Harpreet S. Kwatra Francis J. Doyle III School of Chemical Engineering, Purdue University, West Lafayette, IN 47907-1283 USA

Ilya A. Rybak James S. Schwaber Neural Computation Program, E.I. duPont de Nemours & Co., Wilmington, DE 19880-0328 USA

A simple neuronal network model of the baroreceptor reflex is analyzed. From a control perspective, the analysis suggests a dynamic scheduled control mechanism by which the baroreflex may perform regulation of the blood pressure. The main objectives of this work are to investigate the static and dynamic response characteristics of the single neurons and the network, to analyze the neuromimetic dynamic scheduled control function of the model, and to apply the algorithm to nonlinear process control problems. The dynamic scheduling activity of the network is exploited in two control architectures. Control structure I is drawn directly from the present model of the baroreceptor reflex. An application of this structure for level control in a conical tank is described. Control structure II employs an explicit set point to determine the feedback error. The performance of this control structure is illustrated on a nonlinear continuous stirred tank reactor with van de Vusse kinetics. The two case studies validate the dynamic scheduled control approach for nonlinear process control applications.

1 Introduction A neuronal network architecture for scheduled control of nonlinear process systems was presented recently by Doyle III et al. (1994). The network was originally employed in a simple model of the mammalian baroreceptor reflex (Doyle III et al. 1997). This neuronal network was observed to exhibit a scheduled control activity. Therefore, it was proposed to apply the network in a dynamic scheduled control framework for applications in nonlinear process control. This article presents new results in two main categories: (1) qualitative and quantitative analyses of dynamics of both the single Neural Computation 9, 479–502 (1997)

c 1997 Massachusetts Institute of Technology °

480

Harpreet S. Kwatra et al.

neuron model and the network and (2) case studies of control of a conical tank and a nonlinear continuous stirred tank reactor (CSTR) with van de Vusse kinetics. The baroreceptor vagal reflex is responsible for the regulation of blood pressure and displays a rich range of controlled dynamic behavior. The components of the reflex include pressure sensors in the major blood vessels, a neuronal processor, and several classes of motor actuators of the heart. The pressure signal is encoded by first-order neurons. This information is projected onto second-order neurons in the nucleus tractus solitarus (NTS) (Schwaber 1987), which process this information and modulate actuator signals to the heart to change cardiac rate, contractile timing and pattern, and contractile strength. The closed-loop behavior is fairly intuitive: as pressure rises above “normal,” a control signal is generated in the NTS, which lowers the heart rate. The first-order sensory neurons are well described as highly sensitive, rapidly adapting neurons that transduce and encode each pulse of pressure with a train of spikes on the rising phase of pressure. Because of the distribution of pressure thresholds, the dynamic first-order neurons have a distributed sensitivity to the measured variable (blood pressure) and its rate of change. Data in the literature (Donoghue et al. 1982; Czachurski et al. 1988; Bradd et al. 1989; Rogers et al. 1993) suggest that these neurons map the spatial pressure distribution onto the inputs to the second-order neurons and “schedule” their activities during dynamic changes of the pressure. This observation and general principles of sensory system organization suggest that second-order neurons have a “competitive” response to the input from the first-order neurons. It is plausible that a lateral inhibition organization provides both the competition between the second-order neurons and their scheduled response. This competition may be simulated by a simple neuronal network model in which each second-order neuron responds and provides a control signal in a definite static (dependent on the thresholds of the corresponding first-order neurons) and dynamic (dependent on the velocity of pressure increase) range of pressure changes. Such behavior can be exploited in the formulation of scheduling algorithms for use in controller design. Just as competition between secondorder neurons leads to a selective dynamic response, this attribute can be utilized to apply the appropriate level of nonlinear feedback compensation for a control system. One particularly attractive feature of this biological algorithm is the graceful transition exhibited between operating regimes: there does not appear to be a discrete set of operating levels but rather a continuum. In addition, the transition is dynamic; it is dependent not only on the magnitude of the blood pressure but on its rate of change as well. These two characteristics of the algorithm are in direct contrast to traditional scheduling approaches, which utilize lookup tables for a discrete set ˚ om of operating levels (Astr ¨ & Wittenmark 1989).

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control

481

Figure 1: Simplified architecture of the baroreceptor vagal reflex. The lateral inhibitory function proposed for the inhibitory interneurons is implemented by the first-order neurons themselves for the sake of simplicity. Solid lines represent synaptic excitation; dashed ones signify inhibition. A 9 × 6 network of nine firstorder and six second-order neurons is shown.

2 A Model of the Baroreceptor Reflex An analysis of the experimental results (Rogers et al. 1993; Doyle III et al. 1997) reveals the following dynamic properties of the baroreceptor reflex control structure: 1. The second-order NTS neurons respond to blood pressure changes with bursts of activity that have a characteristic frequency much lower than the frequency of the cardiac cycle. 2. The responses suggest that NTS neurons are inhibited immediately after and perhaps immediately before the bursts. 3. It is reasonable to assume that this bursting activity is the source of regulatory signals that cause the compensatory changes in pressure. 4. It is plausible that each NTS neuron provides this regulation in a defined static and dynamic range of pressure. These observations, combined with other physiological data and general principles of sensory system organization, support the plausibility of a simple baroreflex model. A diagram of the proposed network model for the closed-loop baroreflex is shown in Figure 1 (note that the focus of this model is restricted to the vagal baroreflex, which affects the cardiac output; the effects of the sympathetic system on the heart and the peripheral resistance in the vascular beds have been ignored). The first-order neurons, which are arranged in increasing order of pressure threshold, receive an excitatory input signal that is proportional to the blood pressure. The second-order neurons

482

Harpreet S. Kwatra et al.

receive synaptic excitation from the first-order neurons as depicted in Figure 1 by the solid lines. While the synaptic inhibition of the second-order neurons probably occurs via inhibitory neurons, a simplified model that does not use interneurons is employed, and their function is implemented using first-order neurons only. This is achieved by direct synaptic inhibition from the neighboring first-order neurons (i.e., the periphery of the receptive field; Hubel & Wiesel 1962), as depicted in Figure 1 by the dashed lines. A more realistic mechanism could employ inhibitory interneurons and reciprocal inhibition between the second-order neurons. 2.1 Model of a Single Neuron. Following the Hodgkin-Huxley formalism, the nonlinear dynamics of the membrane potential of a neuron can be described by the following differential equation (Doyle III et al. 1997): cV˙ = g0 (Vr − V) +

X

gi (Ei − V) + I

(2.1)

i

where c is the membrane capacitance, V is the membrane potential, Vr is the resting membrane potential, g0 is the generalized conductance, gi is the deviation in the conductance of the ith ionic channel, Ei is the reversal potential of the ith ionic channel, and I is the input current. Three types of conductances (gi ) are used in the current model: conductances for excitatory and inhibitory synaptic channels that are opened by action potentials (AP) coming from other neurons and a conductance for the potassium channel opened by AP generation in the neuron itself. The potassium channel conductance, threshold dynamics, and neuronal output are modeled by additional nonlinear differential equations. A detailed description, along with the parameter values, is available in Doyle III et al. (1997). Although the model represents a considerable simplification of the more extensive multiple channel neuron kinetic models (Schwaber et al. 1993), it describes the behavior of the baroreflex neurons with sufficient accuracy for the purpose of network modeling. Analysis of Dynamics of Single Neurons. The neuron model above is composed of several complex nonlinear differential equations. It is extremely difficult to interpret the input-output behavior of the model directly from these differential equations. This motivates an analysis of the dynamics of the neuron model in terms of its steady-state locus and dynamic response. Such an analysis validates the model against known experimental results, and also facilitates a system-level interpretation of the model. This interpretation is useful when the neuron model and its network are employed in process control architectures. The response of a single first-order neuron model to several types of input was examined by computer simulation. The first simulation, shown in Figure 2a, demonstrates the well-known nonlinear steady-state behavior of the baroreceptor neurons. Below the threshold blood pressure of 105 mmHg,

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control

483

(a)

(b)

Figure 2: (a) Nonlinear steady-state behavior of a first-order neuron (baroreceptor). Spiking frequency is calculated as the inverse of the interspike interval. (b) Dynamic response of a first-order neuron to positive and negative step changes in blood pressure. The upper graph is the trace of the spiking frequency calculated as the inverse of the interspike interval. The lower graph shows the blood pressure signal. Continued.

484

Harpreet S. Kwatra et al.

(c)

Figure 2: Continued. (c) Response of a first-order neuron to a naturalistic blood pressure trace.

there is no response; the neuron starts spiking once the blood pressure exceeds the threshold value. Note that physiologically, there are about a hundred baroreceptors per nerve with pressure thresholds distributed from well below (35 mmHg) to well above (170 mmHg) the range of resting mean arterial pressure (Kirchheim 1976; Kumada et al. 1990). It is this distributed sensitivity to pressure that lends itself to the scheduling response of the neuronal network described in section 2.2. Figure 2b shows the response of the neuron to positive and negative step changes in blood pressure. The dynamic nature of the neuron is evident here as the sudden rise in blood pressure results in a large increase in the spiking frequency, which then decays down to a lower value (adaptation). Correspondingly, a drop in blood pressure causes the neuron to stop spiking altogether, and then recover to the steady-state spiking frequency (postexcitatory depression). These characteristics validate the neuron model against published experimental data (Brown et al. 1976; Kirchheim 1976). A typical electrophysiological experiment for a rat using a step in blood pressure of 10 mmHg was observed to raise the baroreceptor discharge from around 35 Hz to 75 Hz, which then adapted to around 50 Hz (Brown 1980). Similarly, the negative step in pressure caused a marked depression in discharge with subsequent recovery to the original steady-state spiking frequency.

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control

485

Figure 2c depicts the response of the neuron to an input signal that is similar to the physiological blood pressure trace. The pressure signal has a sharp rise, a slower fall, and a time period of around 150 ms (in the range of the cardiac period of a rat; Doyle III et al. 1997). The figure shows the neuron to fire action potentials mostly on the rising phase of the pressure pulse, which is consistent with observed behavior (Kirchheim 1976; Karemaker 1987). This sensitivity to the rate of change of pressure (or dP/dt) critically contributes to the dynamic nature of the scheduling observed for the neuronal network described in the next section. 2.2 A Network of First-Order and Second-Order Neurons. The general architecture of a neuronal network with excitation and inhibition was discussed in the beginning of section 2. Specifically, a 4 × 1 neuronal structure, in which four first-order neurons converged onto one second-order neuron, was employed as a basic block for the network. Allowing for overlap between the first-order neurons, six of these blocks were combined to yield the neuronal network shown in Figure 1. Note that in this network, the second-order neurons “share” the inputs from the first-order neurons, so as to provide a narrower pressure sensitivity. As will be seen below, such a network architecture gives rise to a scheduled response from the second-order neurons, in specific narrow ranges of the input signal. Analysis of Dynamics of the Network. Just as for the individual neuron models, the response of the network is nonlinear and complex. In fact, the inhibition-excitation architecture of the network makes its response even more complicated and difficult to predict. Once again, this motivates an analysis of the dynamics of the network before it is employed in a simulation of the baroreceptor reflex. Static Response of the Network. First, the static response of the network is analyzed. By static, it is meant that the rate of change of the input signal is small. To implement this analysis, the input (blood pressure) is ramped up very slowly (at the rate of 0.6 mmHg/sec), and the action potentials for each of the second-order neurons are recorded. Note that the input signal drives the nine first-order neurons of Figure 1, and the output signals from these first-order neurons in turn drive the six second-order neurons. Figure 3a shows that the static response of the neurons is distributed in pressure space. The second-order neurons of the network fire in specific ranges (receptive fields) of the blood pressure input signal. This aspect of the network enables it to “schedule” the network input signal to the output of the network and may be exploited in algorithms for control. In fact, the shapes of the response graphs resemble one-dimensional radial basis functions (Hunt et al. 1992) and could be employed in a similar manner for modeling and control. Dynamic Response of the Network. By virtue of the fact that the first-order neurons in the network are intrinsically dynamic (see section 2.1), the outputs of the network exhibit significant dynamic behavior as well. This is demonstrated in Figure 3b, in which the responses of the second-order neu-

486

Harpreet S. Kwatra et al.

(a)

(b)

Figure 3: (a) Static receptive fields of the six second-order neurons of Figure 1. (b) Dynamic receptive fields of the six second-order neurons of Figure 1, at 10 Hz.

rons are plotted for the case when the network is driven by a sinusoid of frequency 10 Hz (close to the heart rate of a rat), added to a slowly ramping input signal (at the rate of 0.6 mmHg/sec). This input excites the dynamics of the network at the frequency of the sinusoid. The dynamic responses

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control

487

appear sparser than the corresponding static ones seen in Figure 3a, indicating decreased overall network activity for inputs with high-frequency content. More interestingly, the figure shows that the response from each of the second-order neurons spreads beyond the static response range. For instance, the static receptive field for the lower-most neuron ranges from 3.0 to 6.0 units of input level, while the dynamic receptive field is spread from 3.0 to 7.2 units. Due to the overlap in the dynamic receptive fields observed in Figure 3b, more than one second-order neuron may be active at any time depending on the frequency content of the input signal. If the input level is rising, a second-order neuron may fire even when the input value is below its static range of response. This may happen because the frequency content in the input signal might be sufficient to excite its dynamics. Note that while the scheduling functionality of the network is due to its excitation-inhibition architecture (see Figure 1), the sensitivity to the dynamic component of the input is attributable to the dP/dt sensitivity of the first-order neurons. From the point of view of control, this dynamic nature of the network is especially relevant. The scheduling function of the network will have a dynamic component as well. In traditional control theory, gain scheduling is restricted to slowly time-varying scheduling variables, as changes in these are not explicitly taken into account in the control design. The dynamic network above would be expected to have interesting dynamic scheduling properties that could be exploited to optimize control performance in process system applications. 2.3 A Simple Model of the Cardiovascular System. The previous sections described the models of the first- and second-order neurons and the neuronal network. Two more blocks have to be specified to complete the model of the baroreceptor reflex. First, the sum of the outputs from the second-order neurons forms the input for an intermediate subsystem (see Figure 1), which is modeled as a simple linear filter. This dynamic system captures the effect of the neural circuitry that lies between the second-order neurons and the heart. This includes the interneurons in the NTS (Spyer 1994) and the motor neurons in the nucleus ambiguus (Loewy & Spyer 1990). Next, a first-order approximation of the cardiovascular dynamics is described. The blood pressure decays exponentially from a current level to the level P0 with the time constant τP . At selected time points, the pressure responds with a jump to the level Pm in response to the pumping action of the heart. One of the driving forces for this subsystem is the disturbance Pi , which represents the effects of an external agent (e.g., a drug infusion). The second input is the feedback signal from the intermediate subsystem. A more detailed description is available in Doyle III et al. (1997). On the afferent side, the input signal to the first-order neurons depends on the pressure P via the amount of stretch in the blood vessels, modeled here by a simple linear relationship, I = kP P, where kP is a “tuning” coefficient.

488

Harpreet S. Kwatra et al.

Figure 4: Closed-loop response of a simple network model of the baroreceptor reflex (time in ms).

2.4 Computer Simulation of the Baroreceptor Reflex. The analyses of the dynamics of the single neuron and the network provide a reasonable starting point for a simulation of the simple model of the baroreceptor reflex. For the model described, a network with four first-order neurons and one second-order neuron is employed. In the general case of multiple secondorder neurons, the sum of the outputs from the second-order neurons may be used. Figure 4 shows the responses of the four first-order neurons (the top two rows) and one second-order neuron (the lower left trace) to a fluctuating pressure signal (the bottom right trace). The cardiovascular system experiences a positive disturbance at time 500 ms, which causes the mean pressure to rise. Due to the barotopical distribution of thresholds, the first-order neurons respond sequentially to increasing mean blood pressure. Hence, the first-order neuron with the lowest threshold (number I) displays the greatest amount of activity. The middle pair of first-order neurons (numbers II and III) excites the second-order neuron, while the other two first-order neurons (numbers I and IV) inhibit it. When the feedback loop is open (not shown here), the persistent disturbance drives the mean pressure to levels much higher than the ones seen in Figure 4. When the feedback loop is closed, the second-order neuron provides the pressure control. As the pressure enters the sensitive range of the second-order neuron, a signal burst is sent to the intermediate block. This block drives the heart with a negative feedback signal, leading to a tem-

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control

489

porary decrease in the pressure level. The persistent external signal drives the pressure up again, and the trend is repeated. The low-frequency bursts exhibited by the second-order neuron are similar to the electrophysiological recordings for NTS neurons (Rogers et al. 1993). While the network behavior of the proposed baroreflex model is a reasonable first approximation of the experimentally recorded neuronal behavior (Doyle III et al. 1997), there is ample scope for refinement. However, the simulation successfully illustrates the dynamic scheduled response of a network of complex neurons and how the neurons in this specific network can be employed to model and predict the behavior of the baroreceptor reflex. This architecture of the network can be exploited in process control algorithms, the subject of discussion in the second half of the article.

3 Applications to Nonlinear Process Control From a control perspective, an interesting feature of the network model of section 2.2 is that individual second-order neurons are active in a narrow static and dynamic range of pressure changes. While the scheduling activity is a result of the network architecture, its dynamic nature is due to the intrinsic behavior of the neurons themselves. These static and dynamic components of the scheduling activity were analyzed in section 2.2. The scheduling functionality of the neuronal network described by the receptive fields in Figures 3a and 3b is of a discrete nature. While the overlapping of the dynamic receptive fields may provide smoother transitions among the scheduled outputs or channels, the range in which a channel is active is nevertheless discrete (e.g., the lower-most neuron in Figure 3b is active from 3.0 to 7.2 units of input). This discrete nature can be traced back to the first-order neurons (baroreceptors), which are active only above their pressure thresholds. Motivated by this insight, a discrete gain scheduling approach is adopted in the current control design. Note, however, that the baroreceptor reflex employs a large number of baroreceptors and NTS neurons such that the discrete scheduling would approach a continuous nonlinear scheduling function. Based on the last observation, a dynamic gain scheduling controller of the continuous nonlinear form is being researched (Kwatra & Doyle III 1995). Gain scheduling, which has been in use in the chemical process industry for more than a decade, is a powerful method for compensating for process nonlinearities when some measurable indicator of changing plant dynam˚ om ics is available (Astr ¨ & Wittenmark 1989). It may then be possible to eliminate the influence of varying dynamics by changing the parameters of the regulator as a function of the measured scheduling variable. Typical applications include distillation column control (Shinskey 1977), pH control (Gulaian & Lane 1990), and batch process control (Luyben 1990). Experience has shown gain scheduling to be both effective and easy to use for process

490

Harpreet S. Kwatra et al.

Figure 5: Schematic of control structure I.

˚ om systems (Astr ¨ & H¨agglund 1990). Given its success, any improvements that build on its strengths would be clearly motivated. In classical gain scheduling, the gains (or other controller parameters) are scheduled according to the instantaneous values of the scheduling variable, and no convolution over time is involved. Even if the scheduling variable measures dynamic aspects of the system, variations in it are not explicitly dealt with during control design. For instance, although aircraft velocity (Mach number) is often used as the exogenous scheduling variable for the control of aircraft acceleration (Reichert 1992), the converse is not true; that is, acceleration is not normally employed as a scheduling variable for the control of aircraft velocity. Typically each controller is obtained for a fixed value of the scheduling variable, independent of past values and time. This feature of the traditional gain scheduling design methodology is directly responsible for its limitation to slow variations in the scheduling parameter (Shamma & Athans 1992). By contrast, the scheduling activity of the network of neurons of the earlier sections has a distinct dynamic component. The dynamic scheduling activity of the network holds promise for exploitation in scheduling algorithms for control of nonlinear process systems, as shown below. Two control architectures are described. First, a control scheme inspired by the baroreceptor reflex model is used to implement disturbance rejection in a conical tank. The shortcomings of this control scheme are addressed in a second control architecture that uses error feedback and a set point. This control scheme is demonstrated on a nonlinear CSTR with van de Vusse kinetics. 3.1 Control Structure I. The first control structure (see Figure 5) is based on the network control scheme used in the simulation of the baroreceptor reflex (see section 2.4). The cardiovascular system of Figure 1 is replaced

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control

491

by the nonlinear process being controlled, with the output scaled to match the pressure variable that is fed to the network. The neuronal network is identical to that used in the baroreflex simulation except that the outputs of the second-order neurons are passed through proportional-integral (PI) controllers before being summed. The PI controllers facilitate additional tuning specific to the nonlinear process being controlled. The intermediate subsystem is the same as before: a linear first-order filter. This control design was tested on a simple nonlinear process system. The system under consideration is a first-order process with a gain that varies nonlinearly as a function of the process output. A physical example of such a system is a level-control problem in a tank with a cross-sectional area that varies with height (e.g., a conical or spherical holding tank). Case Study: Level Control in a Conical Tank. Holding tanks that have variation in cross-sectional area as a function of height are common in the chemical process industry. A conical tank model demonstrates control structure I. The differential equation of a conical tank can be written as τ y˙ = −y− 2 + ky−2 u + d, 3

(3.1)

where y (controlled variable) is the level of the fluid in the tank, u (manipulated variable) is the volumetric feed rate to the tank, d is a disturbance input, and τ and k are constants. This model was obtained from a material balance on the tank, assuming that the discharge coefficient is proportional √ to y. Note that the first two terms on the right-hand side of equation 3.1 are nonlinear in y. Regulator control of the conical tank was implemented with control structure I. The gains of the PI controllers were tuned so as to yield identical local responses around several operating points, as is done in input-output linearization techniques (Isidori 1989). The integral action of all the controllers was identical and chosen such that the disturbance was rejected within 60 sec. The simulation of the control system for a large negative disturbance is shown in Figure 6. The control for the same disturbance using a linear controller obtained by internal model control (IMC) design techniques (Morari & Zafiriou 1989) for the steady-state operating point is also shown. The graph suggests that the dynamic network controller adjusts for the rising fluid level in the tank and prevents the overshoot that occurs for the linear control. This dynamic aspect of the scheduling activity is of interest in scheduling control algorithms. Control structure I suffers from several shortcomings. First, although it demonstrated reasonable regulator control (disturbance rejection) in this example, it has no facility to implement servo control (set point tracking), a necessary component of most chemical process control applications because grade changeovers and set point adjustments are common. Second, the individual controllers in the feedback loop are driven by the outputs of

492

Harpreet S. Kwatra et al.

Figure 6: Regulator response of a conical tank controlled by a bank of PI controllers, which was dynamically scheduled by a 6 × 3 network.

Figure 7: Schematic of control structure II.

the second-order neurons, as opposed to the usual error feedback signal. The implication is that zero offset is not guaranteed in this control design. A second control structure was developed to overcome these problems. While control structure I was consistent with the neurobiological bases in the baroreceptor reflex model, control structure II was modified to meet the requirements of typical process control problems. 3.2 Control Structure II. Figure 7 shows the block diagram of control structure II. The scheduling network is grouped together in a block labeled “scheduler,” and the control elements are placed in the “regulator.” For instance, in the level control problem, the regulator block would simply contain the PI controllers, and the scheduler would have the network of

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control

493

neurons. For the general control problem, the regulator block would contain an array of an appropriate type of controllers, such as PID controllers, and IMC controllers, and the scheduler block would have a suitably sized network of first-order and second-order neurons. A second, and perhaps debatable, modification is the introduction of a set point signal in the control loop. This is necessary from the process control point of view, because almost all process control problems involve setting and adjusting the set point to achieve the desired output from the process. From the point of view of neurophysiology, this is less justifiable because there is no known explicit set point for the baroreceptor reflex. However, since it is known that the central nervous system does have signals converging onto the crNTS (cardiorespiratory NTS) and the medulla (Spyer 1994), it may be reasonable to speculate that some type of implicit set point signal is conveyed to the network of second-order and higher-order neurons of the reflex. The auxiliary or scheduling variable from the process is fed back to the scheduler block, which contains the dynamic scheduling network of neurons. In Figure 7, the process output itself serves as the scheduling variable. Within the scheduler block, this signal forms the input to the first-order neurons. The outputs of the second-order neurons are passed onto the regulator block. The values of these outputs or channels vary between zero and around unity—zero denoting that the channel is inactive and unity indicating that the channel is fully active. This scheduling activity of the network is dynamic, as illustrated in section 2.2. As a result, two neighboring channels may be partially active as the scheduling variable makes the transition from one level to another. In addition, there is sensitivity to the rate of change of the scheduling variable, which makes the scheduling functionality truly dynamic. The feedback error signal is obtained by subtracting the process output from the set point and is then fed into the regulator block along with the scheduler channel outputs. Within the regulator block, each of the channel outputs is simply multiplied with the error signal and fed into the controller corresponding to that channel (see Figure 8). This activates the controllers corresponding to the active scheduler channels, in proportion to their level of activity. Hence, if only one channel is active, the error signal gets multiplied by unity and fed into the controller corresponding to that channel. If two neighboring channels are active with, say, outputs of 0.5 unit each, the error signal gets equally divided between the two corresponding controllers. The output from all the controllers is added to give the control input for the process. This feedback control loop is applied to a nonlinear chemical reactor in the following section. Case Study: Dynamic Scheduled Control of a Nonlinear CSTR. A Reaction with van de Vusse Kinetics. van de Vusse (1964) described a chemical reaction mechanism that has recently been used by several researchers as

494

Harpreet S. Kwatra et al.

Figure 8: Exploded view of the regulator block in control structure II.

a benchmark for nonlinear studies (Kantor 1986; Doyle III & Morari 1992; Engell & Klatt 1993). It involves both a consecutive and a parallel reaction that give rise to undesired by-products. The reaction equation can be written as: k1

k2

A −→ B −→ C k3

2A −→ D.

(3.2)

The control objective, then, is to maximize the rate of production of the desired product (B) and to minimize the rates of production of the byproducts (C and D) by way of manipulation of the catalyst and the reaction conditions (Kantor 1986). Engell and Klatt (1993) describe a specific example of the above chemical reaction in which cyclopentenol (B) is obtained from cyclopentadiene (A) by acid-catalyzed electrophilic addition of water in dilute solution. The undesired by-products are dicyclopentadiene (D) and cyclopentanediol (C). The reaction kinetics employ rate expressions for k1 , k2 , and k3 that obey the Arrhenius equation. The coolant dynamics can be neglected, leading to a third-order nonlinear dynamic system. This gives rise to a single input– single output (SISO) system in which the concentration of the desired product B is controlled by manipulating the inflow of A to the reactor. The differential equations describing the system and the particular parameters used in the simulations here are given in Engell and Klatt (1993). IMC Controller Design. The IMC design methodology was chosen to obtain the bank of controllers due to its relative ease in derivation and tuning (Morari & Zafiriou 1989). The CSTR dynamics were linearized around six

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control

495

Table 1: Transfer Function Gains, Zeros, and Poles Transfer function

K

z1

z2

p1

p2

p3

I II III IV V VI

−2.8711e-05 −2.7708e-05 −2.6467e-05 −2.5000e-05 −2.3453e-05 −2.2592e-05

−0.0074 −0.0059 −0.0044 −0.0029 −0.0014 −0.0002

0.0235 0.0282 0.0328 0.0365 0.0368 0.0307

−0.0357 −0.0333 −0.0304 −0.0267 −0.0219 −0.0165

−0.0136 + 0.0056i −0.0122 + 0.0047i −0.0107 + 0.0037i −0.0091 + 0.0026i −0.0072 + 0.0015i −0.0053 + 0.0009i

−0.0136 − 0.0056i −0.0122 − 0.0047i −0.0107 − 0.0037i −0.0091 − 0.0026i −0.0072 − 0.0015i −0.0053 − 0.0009i

equally spaced points in the operating region of 0.8 to 1.0 mol/` product concentration CB (Engell & Klatt 1993). The resulting six transfer functions were of the type G(s) = K

(s − z1 )(s − z2 ) . (s − p1 )(s − p2 )(s − p3 )

(3.3)

The zero z2 is in the right s-half plane, while the other zero and the poles are all in the left s-half plane (see Table 1). Integral square error (ISE) optimal IMC controllers were derived for the six transfer functions, and first-order filters were appended to them. These are equivalent to classical feedback controllers of the form C(s) = (−1/K)

(s − p1 )(s − p2 )(s − p3 ) . s(s − z1 )(λs + λz2 + 2)

(3.4)

The six IMC-type controllers were placed in the “regulator” block of the feedback control loop of Figure 7. The IMC filter time constant λ was set to 20 seconds. Simulation Results. The IMC-type controllers in the regulator block are dynamically scheduled by a 9×6 network of nine first-order and six secondorder neurons placed in the scheduler block. The simulation is implemented using MATLAB software. Four simulation runs are presented here: three with set point changes in the concentration of B (servo problem) and one with a disturbance input in the inlet concentration of A (regulator problem). The parameters for the CSTR model, set point changes, and disturbance input were chosen to be identical to those used by Engell and Klatt (1993) to facilitate a direct comparison. Their approach was based on traditional static gain scheduling. Their simulation results are reproduced here and are characterized by overshoots of up to 60% and response times of about 0.15 to 0.30 hour.

496

Harpreet S. Kwatra et al.

Figure 9: Servo control for a step set point change in CB from 0.85 to 0.95 mol/`; CAO is 5.1 mol/`.

The first dynamic scheduled control simulation using the network of neurons and IMC-type controllers is for a step change in set point of CB from 0.85 to 0.95 mol/`. Figure 9 shows that CB rises to 0.95 mol/` with a response time of about 0.1 hour and with no significant overshoot. This is distinctly better than the static scheduling approach, which had an overshoot of about 40%. The second simulation is for a set point change in CB from 0.85 to 0.86 mol/`. The response time for this was around 0.1 hour, with no appreciable overshoot (see Figure 10). This compared favorably with the static scheduling approach, which had a response time of 0.3 hour and around 60% overshoot. In addition, the static scheduling control displayed excessive oscillations. Figure 11 shows the case of set point change in CB from 0.95 to 0.85 mol/`. It shows an inverse response, which is larger than that of the static scheduling approach; however, it is improved in the sense that it has no overshoot and a smaller response time (see Table 2). Finally, Figure 12 shows the simulation of the regulator problem in which the control loop attempts to reject the step disturbance in CAO from 4.5 to 5.7 mol/`. The result is better in the dynamic scheduled control case; there is no overshoot again, and the response times are comparable. Discussion. A characteristic feature of the dynamic scheduled control simulations is that the controlled output trajectories are not smooth. This is a direct outcome of employing neurons in the scheduling network; the firing

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control

497

Figure 10: Servo control for a step set point change in CB from 0.85 to 0.86 mol/`; CAO is 5.7 mol/`.

Figure 11: Servo control for a step set point change in CB from 0.95 to 0.85 mol/`; CAO is 4.5 mol/`.

498

Harpreet S. Kwatra et al.

Table 2: Summary of Results for the Nonlinear CSTR Case Study Settling time Magnitude of input

Overshoot

Static Dynamic Static Dynamic scheduling (h) scheduling (h) scheduling (%) scheduling (%)

0.10 set point

0.16

0.1

40

0

0.01 set point

0.30

0.1

60

0

−0.10 set point

0.15

0.1

3

0

0.15

0.15

0.12 disturbance

Figure 12: Regulator control for a step disturbance in CAO from 4.5 to 5.7 mol/`; set point CB is 0.9 mol/`.

of the neurons is not smooth but is composed of spikes and intermediate durations of inactivity. Smooth process curves are desirable, and future work will incorporate a low-pass filter to accomplish this. The individual linear controllers employed in the regulator block are themselves guaranteed stable functions due to the IMC methodology used. Around the steady states that each one of them is designed at, they provide stable closed-loop control, optimal in the integral square error sense. However, away from the steady states and during transitions, stability and performance would need to be inferred from extensive simulations or robustness analysis.

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control

499

Overall, the dynamic scheduled control scheme performed better than the traditional static gain scheduled approach employed by Engell and Klatt (1993), especially in terms of reduced process overshoot. Conceptually, this is understandable, because the dynamic scheduler compensates for a controlled variable that is changing quickly and therefore prevents an overshoot. The simulations demonstrate the viability of the dynamic scheduled control scheme for process control problems. The next step would be to apply the algorithm to a complex nonlinear control problem of industrial relevance. 4 Conclusion and Future Work This article consists of two main parts. The first part describes models of neurons and networks, and their application in simulation of the baroreceptor reflex. Dynamics of the single neuron model are analyzed by simulations, which revealed remarkable similarity of the model responses with the static and dynamic characteristics of the mammalian baroreceptors. A network of first-order and second-order neurons is simulated, and the static and dynamic receptive fields of the second-order neurons are traced. This analysis provided a basis for building a simple network model of the baroreceptor reflex. The proposed neuronal network model exhibits control of blood pressure disturbance in a fashion similar to that observed experimentally (Doyle III et al. 1997). These simulation and analysis results are the first of their kind in modeling efforts of the baroreceptor reflex. The second part of the article reports on application of the network model of the baroreflex to nonlinear process control problems. The dynamic scheduling activity of the network is exploited in two control architectures. Control structure I is directly abstracted from the baroreceptor reflex model and is demonstrated on a nonlinear tank level control problem. Control structure II is developed to provide tools that are necessary for meaningful application in process control. This includes introduction of a set point and use of the feedback error signal. The control structure is illustrated on a nonlinear CSTR with van de Vusse kinetics. The simulation results show better control than traditional gain scheduling, especially in terms of reduced process overshoot. The two case studies validate the usefulness of the dynamic scheduled control approach. A few comments on the issue of closed-loop stability for the dynamic scheduled control algorithm are relevant at this point. As is the case with classical gain scheduling, only local stability can be easily guaranteed, and global stability is inferred by extensive simulations. In general, global stability can be guaranteed under the conditions of sufficiently slow variations in the scheduling variable (Shamma & Athans 1992), but these bounds are difficult to obtain and usually very conservative. In light of the fact that the dynamic scheduled control approach implicitly accounts for fast changes in the scheduling variable, it is plausible that the related bounds on the

500

Harpreet S. Kwatra et al.

scheduling variable rate of change would be less conservative. Also note that for nonlocal operation of the controllers, their net contribution is a weighted (between 0 and 1) sum of the individual control actions. Thus, the weights themselves will not lead to unbounded action and the associated instability. A rigorous proof of stability during these transitions shall be obtained with potentially conservative conic sector bounding (Doyle III et al. 1989) in future work. The formulation of such bounds will be facilitated by the structured form of the dynamic scheduling controller; equal to a weighted contribution of several linear controllers. The nonlinearity arises from the dynamic nature of the weighting function. Since each weighting function wi is bounded (between 0 and 1), it can be cone bounded using a time-varying gain 1i as: x˙ i = A0i xi + 1i A1i xi + bi ui yi = ci xi .

(4.1)

If the process being controlled can be similarly conic sector bounded using a time-varying gain, say, 1p , then it would be possible to combine all the time-varying gains in the closed-loop in the M-1 standard structured uncertainty form (Doyle 1982). Using extensions for time-varying and nonlinear uncertainty (Doyle III et al. 1989), conservative stability conditions could be derived. The success of the dynamic gain scheduling algorithm is primarily due to the fact that while the traditional scheduled control approaches are built on the slowly varying scheduling variable assumption, the current approach employs an explicitly dynamic scheduling mapping. In the neuronal network implementation of the dynamic scheduled control algorithm, the dynamic scheduling is realized by the dynamic nature of the neurons, especially the dP/dt sensitivity of the first-order neurons. Note that although neural net–based scheduling schemes (such as a radial basis function neural net) would lead to a simpler controller by avoiding the complex neuron dynamics, these would typically implement a static scheduling controller. Although dynamics could be introduced in the net by using feedback, the nodes are nevertheless static operators. The neuronal network used in this article employs dynamic neuronal blocks or nodes and is therefore inherently dynamic. On the other hand, all of the complex neuron dynamics may not contribute to the performance of the network controller. For the actual control of nonlinear process systems, the use of such neuronal networks may be unrealistic. Therefore, the underlying principles of the dynamic scheduled control exhibited by the above neuronal network need to be extracted and implemented in a nonlinear control framework (Kwatra & Doyle III 1995). Through continued collaborative research activity, future work will seek to (1) increase the level of complexity in the neuron models to include multiple channel kinetics, (2) refine the system level models such as the model of the cardiovascular system, (3) analyze robustness properties of the pro-

A Neuro-Mimetic Dynamic Scheduling Algorithm for Control

501

posed dynamic scheduled control architectures, and (4) extract the underlying principles of the dynamic scheduled control exhibited by the neuronal network and apply them to nonlinear control theory. Acknowledgments The authors gratefully acknowledge support from the National Science Foundation (grant BCS-9315738). References ˚ om, Astr ¨ K. J., & H¨agglund, T. (1990). Practical experiences of adaptive techniques, Proc. of American Control Conference, San Diego (pp. 1599–1606). ˚ om, Astr ¨ K. J., & Wittenmark, B. (1989). Adaptive control (2nd ed.). Reading, MA: Addison-Wesley. Bradd, J., Dubin, J., Due, B., Miselis, R. R., Monitor, S., Rogers, W. T., Spyer, K. M., & Schwaber, J. S. (1989). Mapping of carotid sinus inputs and vagal cardiac outputs in the rat. Soc. Neurosci. Abstr., 15, 593. Brown, A. M. (1980). Receptors under pressure. An update on baroreceptors. Circulation Research, 1(46), 1–10. Brown, A., Saum, W., & Tuley, F. (1976). A comparison of aortic baroreceptor discharge in normotensive and spontaneously hypertensive rats. Circulation Research, 39, 488–496. Czachurski, J., Dembowsky, K., Seller, H., Nobiling, R., & Taugner, R. (1988). Morphology of electrophysiologically identified baroreceptor afferents and second order neurons in the brainstem of the cat. Arch. Ital. Biol., 126, 129–144. Donoghue, S., Garcia, M., Jordan, D., & Spyer, K. (1982). Identification and brainstem projections of aortic baroreceptor afferent neurons in nodose ganglia of cats and rabbits. J. Physiol. Lond., 322, 337–352. Doyle III, F. J., & Morari, M. (1992). A survey of approximate linearization approaches for chemical reactor control. DYCORD+ 92, IFAC, pp. 207–212. Doyle III, F. J., Henson, M. A., Ogunnaike, B. A., Schwaber, J. S., & Rybak, I. (1997). Neuronal modeling of the baroreceptor reflex with applications in process modeling and control. In D. Elliott (Ed.), Neural networks for control. New York: Springer-Verlag. Doyle III, F. J., Kwatra, H., Rybak, I., & Schwaber, J. S. (1994). A biologicallymotivated dynamic nonlinear scheduling algorithm for control. Proc. of the American Control Conference, Baltimore (pp. 92–96). Doyle III, F., Packard, A., & Morari, M. (1989). Robust controller design for a nonlinear CSTR. Chem. Eng. Sci., 44, 1929–1947. Doyle, J. C. (1982). Analysis of feedback systems with structured uncertainties. IEEE Proc., D, 6, 242–250. Engell, S., & Klatt, K. (1993). Nonlinear control of a non-minimum-phase CSTR. Proc. of the American Control Conference, San Francisco (pp. 2941–2945). Gulaian, M., & Lane, J. (1990). Titration curve estimation for adaptive pH control. Proc. of American Control Conference, San Diego (p. 1414).

502

Harpreet S. Kwatra et al.

Hubel, D., & Wiesel, T. (1962). Receptive fields, binocular integration and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. Hunt, K., Sbarbaro, D., Zbikowski, R., & Gawthrop, P. (1992). Neural networks for control systems—a survey. Automatica, 28(6), 1083–1112. Isidori, A. (1989). Nonlinear control systems—An introduction (2nd ed.). New York: Springer-Verlag. Kantor, J. C. (1986). Stability of state feedback transformations for nonlinear systems—some practical considerations. Proc. of the American Control Conference, Seattle (pp. 1014–1016). Karemaker, J. M. (1987). Neurophysiology of the baroreceptor reflex. In R. I. Kitney & O. Rompelman (Eds.), The beat-by-beat investigation of cardiovascular function (pp. 27–49). Oxford: Clarendon Press. Kirchheim, H. (1976). Systematic arterial baroreceptor reflexes. Physiological Reviews, 56(1), 100–176. Kumada, M., Terui, N., & Kuwaki, T. (1990). Arterial baroreceptor reflex: Its central and peripheral neural mechanisms. Prog. Neurobiol., 35, 331–361. Kwatra, H. S., & Doyle III, F. J. (1995). Dynamic gain scheduled control. AICHE Annual Meeting, Miami Beach. Manuscript submitted. Loewy, A., & Spyer, K. (1990). Vagal preganglionic neurons. In A. D. Loewy & K. M. Spyer (Eds.), Central regulation of autonomic functions (pp. 68–87). New York: Oxford University Press. Luyben, W. L. (1990). Process modeling, simulation, and control for chemical engineers. New York: McGraw-Hill. Morari, M., & Zafiriou, E. (1989). Robust process control. Englewood Cliffs, NJ: Prentice-Hall. Reichert, R. (1992). Dynamic scheduling of modern-robust-control autopilot designs for missiles. IEEE Control Systems Magazine, 12, 35–42. Rogers, R. F., Paton, J. F. R., & Schwaber, J. S. (1993). NTS neuronal responses to arterial pressure and pressure changes in the rat. Am. J. Physiol., 265, R1355– R1368. Schwaber, J. (1987). Neuroanatomical substrates of cardiovascular and emotional-autonomic regulation. In A. Magro, W. Osswald, D. Reis, & P. Vanhoutte (Eds.), Central and peripheral mechanisms in cardiovascular regulation (pp. 353–384). New York: Plenum Press. Schwaber, J., Graves, E., & Paton, J. (1993). Computational modeling of neuronal dynamics for systems analysis: Application to neurons of the cardiorespiratory NTS in the rat. Brain Research, 604, 126–141. Shamma, J. S., & Athans, M. (1992). Gain scheduling: Potential hazards and possible remedies. IEEE Control Systems Magazine, 12, 101–107. Shinskey, F. G. (1977). Distillation control: For productivity and energy conservation. New York: McGraw-Hill. Spyer, K. (1994). Central nervous mechanisms contributing to cardiovascular control. J. Physiol., 474.1, 1–19. van de Vusse, J. (1964). Plug-flow type reactor versus tank reactor. Chem. Eng. Sci., 19, 994–997.

Received August 7, 1995; accepted May 28, 1996.

NOTE

Communicated by Michael Hines

Conductance-Based Integrate-and-Fire Models Alain Destexhe Department of Physiology, Laval University School of Medicine, Qu´ebec, G1K 7P4, Canada

A conductance-based model of Na+ and K+ currents underlying action potential generation is introduced by simplifying the quantitative model of Hodgkin and Huxley (HH). If the time course of rate constants can be approximated by a pulse, HH equations can be solved analytically. Pulse-based (PB) models generate action potentials very similar to the HH model but are computationally faster. Unlike the classical integrate-andfire (IAF) approach, they take into account the changes of conductances during and after the spike, which have a determinant influence in shaping neuronal responses. Similarities and differences among PB, IAF, and HH models are illustrated for three cases: high-frequency repetitive firing, spike timing following random synaptic inputs, and network behavior in the presence of intrinsic currents.

1 Introduction Over 40 years ago, Hodgkin and Huxley introduced a remarkably successful quantitative description of action potentials (Hodgkin & Huxley, 1952). More recent techniques revealed that transmembrane ionophores underlie ionic currents and that the assumptions of the Hodgkin-Huxley (HH) model were essentially correct if reinterpreted in the light of current knowledge on the activation of ion channels (see Hille, 1992). Although Markov kinetic models describe the properties of sodium channels more adequately (Vandenberg & Bezanilla, 1991), the success of the HH model is due to its “observability,” as its parameters are easily evaluated from voltage-clamp experiments. It remains the most widely used description of voltage-dependent conductances. The drawback of using HH models is that they require solving a set of several first-order differential equations in addition to cable equations. Moreover, numerical integration of these equations can require relatively small integration steps due to the nonlinearity and fast kinetics of the Na+ and K+ currents underlying action potentials. This fine integration is then imposed to the whole system, which may have slower currents that would not require such a high temporal resolution but may result in needs of considerable computation time for simulating large-scale neuronal networks. Neural Computation 9, 503–514 (1997)

c 1997 Massachusetts Institute of Technology °

504

Alain Destexhe

An alternative approach was to simplify the process of action potential generation by representing the neuron as an integrate-and-fire (IAF) device (see Arbib, 1995). In this case, the neuron produces a spike when its membrane potential crosses a given threshold value and the membrane is instantaneously reset to its resting level. Although this type of model has been extensively used for representing large-scale neuronal networks, an important drawback is that it neglects the variations of Na+ and K+ conductances, which may have important effects in spike generation (Mainen, 1995). Moreover, central neurons have been shown to contain a large spectrum of other intrinsic currents, which have a determinant effect on firing behavior (Llin´as, 1988). In this case, the IAF paradigm is clearly inappropriate. In this article, we propose a simplified model of action potentials based on HH equations. Besides its similarity with IAF models, this method is shown to capture the conductance changes due to action potentials, which has a determinant influence on electrophysiological behavior. 2 Pulse-Based Models for Voltage-Dependent Currents We start from the equations of Hodgkin and Huxley (1952): Cm V˙ = −gleak (V − Eleak ) − g¯ Na m3 h (V − ENa ) − g¯ Kd n4 (V − EK ) ˙ = αm (V) (1 − m) − βm (V) m m h˙ = αh (V) (1 − h) − βh (V) h n˙ = αn (V) (1 − n) − βn (V) n

(2.1)

where V is the membrane potential, Cm = 1 µF/cm2 is the membrane capacitance, and gleak = 0.1 mS/cm2 and Eleak = −70 mV are the leak conductance and reversal potential, respectively. g¯ Na = 100 mS/cm2 and g¯ Kd = 80 mS/cm2 are the maximal conductances of the sodium current and delayed rectifier with reversal potentials of ENa = 50 mV and EK = −90 mV. m, h, and n are the activation variables, whose time evolution depends on the voltage-dependent rate constants αm , βm , αh , βh , αn , and βn . The voltagedependent expressions of the rate constants were as in the version described by Traub and Miles (1991). The behavior of the HH model in a single compartment cell during a train of action potentials is shown in Figure 1 (left panel). As a consequence of the rapid changes in membrane voltage during action potentials, the time course of the rate constants exhibits sharp transitions. This is best illustrated by the behavior of αm : the function αm (V) is approximately zero for the whole range of membrane potential below ∼ −50 mV, then increases progressively with depolarization; consequently, before and after the action potential, αm ∼ 0 and has positive values only during the spike (see Fig. 1).

Conductance-Based Integrate-and-Fire Models

505

Figure 1: Comparison of action potential generation in Hodgkin-Huxley and pulse-based models. A train of spikes was generated by injecting a depolarizing current pulse (2 nA) in a single compartment model (area of 15,000 µm2 ; g¯ Kd = 30 ms/cm2 ; other parameters as in equation 2.1). The time course of the rate constants (αm , βm , αh , βh , αn , and βn ) and state variables (m, h, and n) is shown for each panel. In the Hodgkin-Huxley model (left panel), rate constants undergo sharp transitions when a spike occurs. Approximating these transitions by pulses (right panel) led to a simplified representation of spikegenerating mechanisms (threshold of −53 mV, pulse duration of 0.6 ms, rate values as in equation 4).

The basis of the simplified model is to approximate the time course of the rate constants by a pulse triggered when the membrane potential crosses a

506

Alain Destexhe

given threshold. For example, αm will be assigned a positive value during the pulse and will be zero otherwise. Generalizing for other rate constants, before and after the pulse, we have: αm = 0,

βm = β M

αh = αH ,

βh = 0

αn = 0,

βn = βN

(2.2)

whereas during the pulse: αm = α M ,

βm = 0

αh = 0,

βh = βH

αn = αN ,

βn = 0.

(2.3)

Here, αM , βM , αH , βH , αN , and βN are constant values. They are estimated from the value of the voltage-dependent expressions at hyperpolarized and depolarized potentials. Choosing −70 mV and +20 mV and using Traub and Miles’s formulation gives: αM = αm (20) ' 22 ms−1 βM = βm (−70) ' 13 ms−1 αH = αh (−70) ' 0.5 ms−1 βH = βh (20) ' 4 ms−1 αN = αn (20) ' 2.2 ms−1 βN = βn (−70) ' 0.76 ms−1 .

(2.4)

It is then straightforward to solve equations 2.1, leading to the following expressions: before and after the pulse, the rate constants are given by: m(t) = m0 exp [−βM (t − t0 )] h(t) = 1 + (h0 − 1) exp [−αH (t − t0 )] n(t) = n0 exp [−βN (t − t0 )]

(2.5)

where t0 is the time at which the last pulse ended, and m0 , h0 , and n0 are the values of m, h, and n at that time. During the pulse, the rate constants are given by: m(t) = 1 + (m0 − 1) exp [−αM (t − t0 )] h(t) = h0 exp [−βH (t − t0 )] n(t) = 1 + (n0 − 1) exp [−αN (t − t0 )].

(2.6)

Conductance-Based Integrate-and-Fire Models

507

Here, t0 is the time at which the pulse began, and m0 , h0 , and n0 are the values of m, h, and n at that time. Using this procedure, the simplified model generated action potentials similar to the HH model (see Fig. 1, right panel). The time course of the activation variables m, h, and n was also comparable in both models. The optimal values of parameters were estimated by fitting expressions 2.5 and 2.6 directly to the original HH model. Using a simplex procedure (Press et al., 1986), and starting from different initial values of the parameters, the fitting led to a unique estimate of the pulse duration (0.6 ± 0.08 ms) and of the membrane threshold (−50.1 ± 0.3 mV; other parameters were as in Fig. 1). The computational efficiency of the pulse-based mechanism is essentially due to the direct estimation of variables m,h,n from the membrane potential (see equations 2.5 and 2.6). Another factor providing further acceleration is that outside the pulse, an expression similar to equation 2.5 can be written for m3 and n4 , which greatly reduces the number of multiplications since no exponentiation is calculated. 3 Comparison of Pulse-Based Models with Other Mechanisms The important point of pulse-based (PB) models is that they provide a good approximation of dynamical properties of firing behavior described by the HH mechanism. These properties were tested in three ways: repetitive firing, random firing, and network behavior. First, the optimized PB model was compared to HH and IAF mechanisms following injection of current pulses of increasing amplitudes (see Fig. 2). Pulse-based models gave an excellent approximation of interspike intervals at different firing frequencies, the spike shape, and the rise and decay of the membrane potential. On the other hand, the IAF model taken in the same conditions gave a poor approximation (see Fig. 2). For IAF models, the threshold or the absolute refractory period affected the frequency of firing, but variations of these parameters failed to generate the correct firing frequencies. Second, PB models were tested in the case of random synaptic bombardment in a single-compartment cell having AMPA and GABAA postsynaptic receptors (see Fig. 3). The conductance of synaptic currents was chosen such that most postsynaptic potentials were subthreshold, occasionally leading to firing. This subthreshold dynamics is surprisingly irrelevant for spiking as PB and IAF mechanisms behaved very closely to the HH model (see Fig. 3), with spike timing differences of 1.1 ± 1.8 ms (average ± standard deviation) and rare occurrences of nonmatching spikes. The precision in spike timing was very sensitive to the value of the threshold and slightly better for PB models. Third, PB models were tested in network simulations having both intrinsic and synaptic currents, besides INa and IKd . The network considered was a model of spindle oscillations in interconnected thalamocortical (TC) and

508

Alain Destexhe

Figure 2: Comparison of repetitive firing with three different models. A single compartment cell (same parameters as in equation 1) was simulated with injection of three depolarizing current pulses of increasing amplitudes (1 nA, 2 nA, and 4 nA). Top trace: Simulation using the original Hodgkin-Huxley model. Middle trace: Simulation using pulse-based equations (identical parameters as in Figure 1, right). Bottom trace: Simple integrate-and-fire model (the spike consisted of a sudden rise to 50 mV, immediately followed by a reset to −90 mV and an absolute refractory period of 1.5 ms; same threshold as the pulse-based mechanism). All models had identical passive properties.

thalamic reticular (RE) neurons (Destexhe, Bal, McCormick, & Sejnowski, 1996). These neurons have a set of intrinsic sodium, potassium, calcium, and cationic currents, allowing the cell to produce different types of bursts of action potentials (see Steriade, McCormick, & Sejnowski, 1993). Although single TC and RE cells do not necessarily generate sustained oscillations, interconnected TC and RE cells with excitatory (AMPA-mediated) and inhibitory (GABAergic) synapses can display spindle oscillations, as seen from

Conductance-Based Integrate-and-Fire Models

509

Figure 3: Comparison of pulse-based and original Hodgkin-Huxley models using random synaptic bombardment. Top panel: Single compartment model (same parameters as in equation 1) with 100 excitatory (AMPA-mediated) and 10 inhibitory (GABAA -mediated) synaptic inputs. Synaptic currents were described by kinetic models (Destexhe et al., 1997) and had maximal conductances of 1.5 and 6 nS for each simulated AMPA- and GABAA -mediated contact, respectively. Each synaptic input was random (Poisson distributed) with a mean rate of 50 Hz. Middle panel: Same simulation with action potentials generated using pulse-based models. Bottom panel: Same simulation using integrate-and-fire mechanisms.

in vivo and in vitro recordings (Steriade et al., 1993). This behavior was successfully reproduced by the model using either HH equations or PB models (see Fig. 4). As in the case of synaptic bombardment, action potential timing

510

Alain Destexhe

Figure 4: Comparison of simplified and original Hodgkin-Huxley models using simulations of circuits of neurons. Left panel: Model of spindle oscillations in a network of 16 thalamic neurons. Eight thalamocortical (TC) cells projected to 8 thalamic reticular (RE) neurons using AMPA-mediated contacts, whereas RE cells contacted themselves using GABAA receptors and TC cells using a mixture of GABAA and GABAB receptors. Each neuron had Na+ , K+ , Ca+ , and cationic voltage-dependent currents described by Hodgkin-Huxley–type equations, whose parameters were obtained from voltage-clamp and current-clamp data (Destexhe et al., 1996). One cell of each type is represented during a spindle sequence. Middle panel: Same simulation using pulse-based mechanisms for action potentials. Right panel: Same simulation using integrate-and-fire models.

was not rigorously identical in the two models, but spindle oscillations had the correct frequency and phase relationships between cells. On the other hand, the high-frequency firing of TC and RE cells was not well captured by IAF mechanisms, which gave rise to different oscillation frequencies at the network level (see the right panel of Fig. 4). Finally, the computational performance was compared for different mechanisms of spike generation in a single compartment cell. Table 1 shows that the PB method is significantly faster than differential equations (DE). According to the algorithms available with the NEURON simulator (Hines, 1993), the integration of HH equations can be optimized (DEo) by using precalculated rate constants, tabulated as a function of V. Although these optimized HH equations led to substantial acceleration (DEo vs. DE), they were not as efficient as PB models, which estimate m, h, and n directly from V. Simple IAF models showed the highest performance.

Conductance-Based Integrate-and-Fire Models

511

Table 1: Computational Performance of Different Methods for Generating Action Potentials. Method

DE DEo PB IAF

NEURON (relative CPU time)

Minimal C Code (relative CPU time)

1 0.32 0.25 0.23a

1 0.45 0.17 0.08

Note: A single-compartment cell comprising passive currents, 1 nA injected current, and a model for action potentials was simulated. Different models for action potentials were: DE: Hodgkin-Huxley model described by differential equations; DEo: same equations solved using optimized algorithms; PB: pulse-based model; IAF: simple integrate-andfire model. A backward Euler integration scheme was used with a time step of 0.025 ms over a total simulated time of 50 seconds. The equations were solved using either the NEURON simulator (Hines, 1993) or directly in C code (minimal code for solving the equations without graphic interface; the exact same algorithms and optimizations as in NEURON were used). Codes are available on request. a IAF models could not be implemented optimally in NEURON.

4 Discussion This article has presented a novel method to approximate the dynamics of action potentials. The model was based on the HH equations in which the time course of rate constants was approximated by a short pulse. In a previous article (Destexhe, Mainen, & Sejnowski, 1994a), similar PB mechanisms were introduced to model synaptic currents. The basis of the synaptic model was that artificially applied pulses of agonist gave rise to postsynaptic currents with a time course very similar to the intact synapse (e.g., Colquhoun, Jonas, & Sakmann, 1992). PB models of synaptic currents offered the advantages of being simple and fast, and they could be easily fit to experimental data. In the case reported in this article, the pulse approximation of HH equations was based on the observation that rate constants undergo sharp transitions during action potentials (see Fig. 1), also leading to analytic resolution and therefore fast algorithms that can be fit easily to any template. In principle, the same approximation could be considered for other types of channels, under the important condition that rate constants display sharp transitions. A possible example would be Ca2+ -dependent channels, assuming that an action potential triggers a short pulse of intracellular Ca2+ concentration due to high-threshold Ca2+ currents. Are PB models different from conventional IAF mechanisms ? Both methods share a fixed voltage threshold as well as a dead time following spikes to represent the absolute refractory period. PB models, however, significantly

512

Alain Destexhe

differ from IAF models because they take into account the decay of variables m, h, and n in between spikes, which has a determinant influence on repetitive firing at high frequencies. In IAF models, the membrane is reset instantaneously after a spike to a hyperpolarized level. In PB models, the reset occurs through the activation of the delayed-rectifier current, which has an effect over a longer period of time due to the relatively slow decay of the variable n. This is the main factor that affected the frequency of firing in Figure 2. However, we cannot exclude the possibility that more sophisticated IAF models (with relative refractory period or variable threshold) would reproduce these properties, although it still remains to be demonstrated. Are PB models different from precalculated conductance waveforms? The model presented here is more sophisticated than simply pasting a conductance waveform, as the changes in variables m, h, and n depend on their initial values at the time of the spike. Conductance changes are different during low- and high-frequency firing, and PB models propose simplified equations to capture these changes. This is similar to comparing alpha functions with PB models of synaptic currents: Alpha functions do not provide correct postsynaptic summation, whereas it is well captured by PB models (see Destexhe, Mainen, & Sejnowski, 1994b). When do PB models fail? It was shown here that PB models approximate well HH equations for high-frequency firing during injection of current pulses (see Fig. 2), following random synaptic inputs (see Fig. 3), or in network simulations where intrinsic and synaptic currents were present (see Fig. 4). However, as the spike is defined from a fixed threshold, PB models would not apply to cases with significant subthreshold contribution of the HH mechanism in determining spike timing or if there are dynamical changes in the threshold. PB models cannot be applied to model various properties of sodium channels, such as the progressive spike inactivation during bursting, anode-break excitation, voltage-dependent recovery from inactivation, and more generally the response of these channels during voltage-clamp experiments. In these cases, spike generation should be modeled with the original HH equations or a more accurate mechanism. How do PB models compare to other types of simplifications of HH equations? Several models were proposed to reduce HH equations to fewer variables (e.g., Fidzhugh, 1961; Krinskii & Kokoz, 1973; Hindmarsh & Rose, 1982; Rinzel, 1985; Kepler, Abbott, & Marder, 1992). Another type of simplification was to start from detailed kinetic models for the gating of Na+ or K+ channels and then simplify this scheme until reaching the minimal number of states to account for the behavior of the current (Destexhe et al., 1994b). Pulse-based models are different from these previous approaches; the simplifying assumptions of these models lead to a complete analytical resolution of HH equations, and therefore no differential equation must be solved (see also Miller, 1979). How large is the increase of computational efficiency? PB models provide a substantial acceleration compared to HH equations, approaching the per-

Conductance-Based Integrate-and-Fire Models

513

formance of IAF models (see Table 1). However, this increase of efficiency could be minimal in the case of network simulations where action potentials represent only a small fraction of the computation time. On the other hand, if PB models could be used in conjunction with pulse-based mechanisms applied to other types of currents, such as synaptic currents (Destexhe et al., 1994a; Lytton, 1996), then this approach could provide a considerable acceleration of computation time. How can PB models be improved? An obvious improvement would be to apply the PB approximation to a simplified model of action potentials, such as that of Hindmarsh and Rose (1982), leading to PB models with fewer variables. Another improvement would be to add a relative refractory period or a variable threshold, for example, by making the threshold depend on the value of the inactivation variable h. More optimal values of rate constants than equation 2.4 could also be obtained by fitting the model directly to HH equations using several templates, such as that of Figs. 2 and 3. In conclusion, PB models capture important features of the firing behavior as described by HH mechanisms, but they do not apply to all cases. This approach is intermediate between simple IAF and more detailed biophysical models and may facilitate building large-scale network simulations that incorporate the intrinsic electrophysiological properties of single cells. Acknowledgments All simulations were run on a Sun Sparc 20 workstation using the NEURON simulator (Hines, 1993) or using programs written in C. Research was supported by MRC (Medical Research Council of Canada), and start-up funds came from FRSQ (Fonds de la Recherche en Sant´e du Qu´ebec). References Arbib, M. (1995). The handbook of brain theory and neural networks. Cambridge, MA: MIT Press. Colquhoun, D., Jonas, P., & Sakmann, B. (1992). Action of brief pulses of glutamate on AMPA/KAINATE receptors in patches from different neurons of rat hippocampal slices. J. Physiol., 458, 261–287. Destexhe, A., Mainen, Z., & Sejnowski, T. J. (1997). Kinetic models of synaptic transmission. In C. Koch & I. Segev, (Eds.), Methods in Neuronal Modeling (2nd ed.). Cambridge, MA: MIT Press. Destexhe, A., Bal, T., McCormick, D. A., & Sejnowski, T. J. (1996). Ionic mechanisms underlying synchronized oscillations and propagating waves in a model of ferret thalamic slices. J. Neurophysiol., 76, 2049–2070. Destexhe, A., Mainen, Z., & Sejnowski, T. J. (1994a). An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Computation, 6, 14–18.

514

Alain Destexhe

Destexhe, A., Mainen, Z., & Sejnowski, T. J. (1994b). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Computational Neurosci., 1, 195–230. Fidzhugh, R. (1961). Impulses and physiological states in models of nerve membrane. Biophys. J., 1, 445–466. Hille, B. (1992). Ionic channels of excitable membranes. Sunderland, MA: Sinauer Associates. Hindmarsh, J. L., & Rose, R. M. (1982). A model of the nerve impulse using two first-order differential equations. Nature, 296, 162–164. Hines, M. (1993). NEURON—A program for simulation of nerve equations. In F. H. Eeckman (Ed.), Neural Systems: Analysis and Modeling (pp. 127–136). Boston: Kluwer Academic Publishers. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London), 117, 500–544. Kepler, T. B., Abbott, L. F., & Marder, E. (1992). Reduction of conductance-based neuron models. Biol. Cybernetics, 66, 381–387. Krinskii, V. I., & Kokoz, Y. M. (1973). Analysis of the equations of excitable membranes—1. Reduction of the Hodgkin-Huxley equations to a secondorder system. Biofizika, 18, 506–511. Llin´as, R. R. (1988). The intrinsic electrophysiological properties of mammalian neurons: A new insight into CNS function. Science, 242, 1654–1664. Lytton, W. W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Computation, 8, 501–509. Mainen, Z. F. (1995). Mechanisms of spike generation in neocortical neurons. Unpublished doctoral dissertation, University of California, San Diego. Miller, R. M. (1979). A simple model of delay, block and one way conduction in Purkinje fibers. J. Math. Biol., 7, 385–398. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1986). Numerical recipes: The art of scientific computing. Cambridge: Cambridge University Press. Rinzel, J. (1985). Excitation dynamics: Insights from simplified membrane models. Fed. Proc., 44, 2944–2946. Steriade, M., McCormick, D. A., & Sejnowski., T. J. (1993). Thalamocortical oscillations in the sleeping and aroused brain. Science, 262, 679–685. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. Vandenberg, C. A., & Bezanilla, F. (1991). A model of sodium channel gating based on single channel, macroscopic ionic, and gating currents in the squid giant axon. Biophys. J., 60, 1511–1533.

Received January 25, 1995; accepted June 21, 1996.

NOTE

Communicated by William Lytton

A Simple Model of Transmitter Release and Facilitation Richard Bertram Pennsylvania State University, Division of Science, Erie, PA 16563 USA

We describe a model of synaptic transmitter release and presynaptic facilitation that is based on activation of release sites by single Ca2+ microdomains. Facilitation is due to Ca2+ that remains bound to release sites between impulses. This model is inherently stochastic, but deterministic equations can be derived for the mean release. The number of equations required to describe the mean release is small, so it is practical to use the model with models of neuronal electrical activity to investigate the effects of different input spike patterns on presynaptic facilitation. We use it in conjunction with a model of dopamine-secreting neurons of the basal ganglia to demonstrate that transmitter release is greater when the neuron bursts than when it spikes continuously, due to the greater facilitation generated by the bursting impulse pattern. Finally, a minimal form of the model is described that is coupled to simple models of postsynaptic receptors and passive membrane to compute the postsynaptic voltage response to a train of presynaptic stimuli. This form of the model is appropriate for neural network simulations.

1 Introduction Several models have been proposed for synaptic transmitter release and presynaptic facilitation, the process whereby impulse-evoked release is enhanced for up to several hundred milliseconds if preceded by one or more conditioning impulses (Magleby, 1987). In the models of Zucker and Fogelson (1986) and Yamada and Zucker (1992), release sites and Ca2+ channels are spatially separated, and facilitation is due primarily to residual free Ca2+ left over from previous impulse-induced Ca2+ channel openings, so the diffusion of free Ca2+ is important. In an alternative model (Stanley, 1986; Bertram, Sherman, & Stanley, 1996), each transmitter release site is colocalized with a single Ca2+ channel. Release is then activated by the microdomain of Ca2+ surrounding the channel mouth, eliminating the need to perform Ca2+ diffusion calculations. This model is intended to describe release evoked by short depolarizations, such as action potentials, during which the opening of multiple Ca2+ channels in the neighborhood of a release site is unlikely. Neural Computation 9, 515–523 (1997)

c 1997 Massachusetts Institute of Technology °

516

Richard Bertram

In the single-domain model, each release site has four independent Ca2+ binding sites or gates, each of which must be bound for release to occur. Kinetics of the gates are graded from slow unbinding, high affinity to fast unbinding, low affinity. Facilitation is due to the accumulation of Ca2+ bound to one or more of the gates rather than to residual free Ca2+ . This model was motivated by the findings that release sites can be activated by the Ca2+ microdomain under a single channel (Stanley, 1993) and that there is a steplike frequency dependence of facilitation associated with a steplike decline of Ca2+ cooperativity (Stanley, 1986). There is also considerable evidence that residual free Ca2+ does not play an important role in facilitation (Stanley, 1986; Blundon, Wright, Brodwick, & Bittner, 1993; Winslow, Duffy, & Charlton, 1994), although it does appear to play a role in other forms of enhancement that last for several seconds (augmentation) or several minutes (posttetanic potentiation) following a high-frequency conditioning tetanus (Delaney, Zucker, & Tank, 1989; Delaney & Tank, 1994). 2 The Mathematical Model The binding of Ca2+ to the jth gate obeys a first-order kinetic scheme: kj+

Bj , Uj + Ca − kj

j = 1, 2, 3, 4,

(2.1)

where Bj and Uj are the normalized (Bj + Uj = 1) concentrations or probabilities of bound and unbound gates and Ca is the concentration of free Ca2+ at the release site. The binding rates (kj+ , in msec−1 µM−1 ) and unbind-

ing rates (kj− , in msec−1 ) were chosen to produce steps in the facilitation

curve: k1+ = 3.75 × 10−3 , k1− = 4 × 10−4 , k2+ = 2.5 × 10−3 , k2− = 1 × 10−3 , k3+ = 5 × 10−4 , k3− = 0.1, k4+ = 7.5 × 10−3 , k4− = 10. The kj− vary from small to

large, resulting in a wide range of unbinding time constants (1/kj− ): 2.5 sec, 1 sec, 10 msec, and 0.1 msec, respectively. The temporal evolution of Bj is described by dBj = −(kj− + kj+ Ca) Bj + kj+ Ca , dt

j = 1, 2, 3, 4.

(2.2)

The opening and closing of a Ca2+ channel is a stochastic process, with transition rates that depend on membrane voltage (V). The quantity Ca is therefore a random variable. In Bertram et al. (1996), a Monte Carlo procedure was used to handle the effects of stochastic channel activity on release. Although effective, this approach is computationally demanding, and a better approach was subsequently developed (Bertram & Sherman, in press). Here, Fokker-Planck-type partial differential equations were derived for the probability distributions of bound gates under open and closed

Transmitter Release and Facilitation

517

Ca2+ channels. Ordinary differential equations for the expected values of these distributions—that is, the mean concentrations of bound gates of each type—were then derived. Although the four gates are physically independent, they are statistically correlated through their common dependence on the domain Ca2+ concentration. Therefore, expected or mean release is not simply the product of the mean bound concentrations of each gate type; it depends on the mean concentrations of all gate configurations, resulting in a system of 30 differential equations. However, it was shown that the correlation among gates is removed and the number of equations necessary to compute release reduced to 4 if it is assumed that the gates respond to the average domain Ca2+ concentration—the domain Ca2+ concentration averaged over the entire population of Ca2+ channels. Using perturbation analysis, it was shown that the error introduced by this approximation is typically small. Taking the expected values of both sides of equation 2.2, and replacing E[Ca Bj ] by E[Ca] E[Bj ] (the average domain Ca2+ approximation), we get dσj = −(kj− + kj+ Ca) σj + kj+ Ca , dt

j = 1, 2, 3, 4,

(2.3)

where σj is the expected value of Bj and Ca = E[Ca] is the average domain Ca2+ concentration, equal to the product of the fraction of open channels (m) and the domain Ca2+ concentration at an open channel (Caopen ). The latter is assumed to be proportional to the Ca2+ influx through an open channel (i(V)): Caopen = −A i(V). Assuming that the Ca2+ channel has a single open and a single closed state, the fraction of open channels satisfies dm = αm (1 − m) − βm m dt

(2.4)

where αm , βm are the channel opening and closing rates, respectively. Expected release is then E[R] = σ1 σ2 σ3 σ4 .

(2.5)

The fourth gate, unlike the others, has fast unbinding kinetics, so that σ4 changes rapidly with changes in Ca. Thus, the number of differential equations needed to compute E[R] is further reduced by one by assuming that σ4 is in equilibrium with Ca.

3 Example: The Dopamine Neuron Our release model can be used with any model of membrane voltage or even with experimental voltage recordings. As an example, we use it in

518

Richard Bertram

conjunction with a model of in vitro electrical activity of the dopamine neuron of the basal ganglia (Li, Bertram, & Rinzel, 1996). A periodic bursting pattern, high-frequency spiking followed by a long period of silence, is produced (see Fig. 1B) in the presence of the glutamate agonist N-methylD-aspartate (NMDA). In the absence of NMDA, the cell spikes continuously at a lower frequency (see Fig. 1A). Mean release evoked by the continuous spiking pattern shows some facilitation (see Fig. 1E), due to the slow growth of σ1 and σ2 (see Fig. 1C). However, facilitation generated by the bursting pattern is much greater (see Fig. 1F), consistent with in vivo experimental findings (Gonon, 1988). In this case σ3 increases during the first portion of the active phase, while σ1 and σ2 increase throughout the active phase (see Fig. 1D), facilitating release within an active phase. Between active phases σ3 decays back to its baseline value, but the burst period is short enough (800 msec) that the active-phase increases in σ1 and σ2 are not entirely lost. Therefore, σ1 and σ2 accumulate over several bursts, and release evoked by later bursts is greater than that evoked by earlier ones. 4 A Minimal Model To demonstrate how the transmitter release model can be used in neural network simulations, it is simplified to a form that can be used with integrateand-fire neurons. This model is coupled to simple models of postsynaptic transmitter receptors and passive membrane to generate a postsynaptic voltage response. Following Destexhe, Mainen, & Sejnowski (1994a) we assume that each impulse leads to the release of a square pulse of transmitter of 1 msec duration. In contrast to Destexhe et al., we assume that the size of the transmitter pulse evoked by each stimulus is not constant. The magnitude of the initial pulse is T(1) , while the magnitude of each subsequent pulse is the product of T(1) and the presynaptic facilitation. Facilitation is computed by solving equation 2.3 for σ1 , σ2 , and σ3 (or just one of the three for the simplest model that exhibits facilitation) assuming that each presynaptic stimulus leads to a square pulse of Ca lasting 1 msec. (With 10 mM external Ca2+ , Ca summed over an impulse is 63 µM, using the Hodgkin-Huxley equations [Hodgkin & Huxley, 1952] to generate the impulse. This value may be used as the magnitude of the Ca pulse, scaled linearly to reflect any changes in the external Ca2+ concentration.) Facilitation contributed by the jth gate is then the ratio of σj at the end of the nth pulse to the value at the end of the first pulse. We assume that the facilitating transmitter pulses act on a simple twostate postsynaptic receptor (Destexhe et al., 1994a), with the fraction of bound receptors (a) given by da = αa T (1 − a) − βa a, dt

(4.1)

Transmitter Release and Facilitation

519

Figure 1: Expected release (see equations 2.3–2.5) evoked by a dopamine neuron model. In the presence of NMDA, the cell generates bursts of action potentials (B); otherwise, it spikes continuously (A). Release is much greater when the cell bursts (E, F; notice the different scales) due to the greater accumulation of bound gates (C, D; σ3 has been multiplied by 30). The singlechannel Ca2+ current is given by a Goldman-Hodgkin-Katz expression, i(V) = 1.45V[Caex /(1 − exp(0.07V))], where Caex = 1 mM is the external Ca2+ concentration. We use A = 0.1 µM fA−1 and αm = 3.1 eV/10 , βm = e−V/26.7 to compute Ca. Both αm and βm are based on squid giant synapse data (Llin´as, Steinberg, & Walton, 1981) modified for the higher temperature at which dopamine neurons typically operate (37◦ F).

520

Richard Bertram

where T is the transmitter concentration and αa and βa are the binding and unbinding rates, respectively. Postsynaptic membrane voltage (Vpost ) is computed with the passive membrane equation dVpost = −(Imem + Isyn ) / Cmem dt

(4.2)

where Cmem is the membrane capacitance, Imem = g¯ mem (V − Vmem ) is the passive membrane current, and Isyn = g¯ syn a (V − Vsyn ) is the synaptic current. Note that since both Ca and T appear as square pulses in equations 2.3 and 4.1, the equations have piecewise exponential solutions (see Destexhe et al., 1994a). In Figure 2 this minimal model is used to compute the postsynaptic voltage response to a short 100-Hz stimulus train. The synapse facilitates throughout the train, producing consistently larger transmitter pulses (see Fig. 2A) and postsynaptic voltage depolarizations (see Fig. 2B). Combined with the relatively slow membrane dynamics, this leads to a substantial rise in the average postsynaptic voltage (see Fig. 2B, solid curve). In contrast, when facilitation is excluded by assuming that the size of each transmitter pulse is the same, the average postsynaptic voltage exhibits only a small rise, associated with the slow membrane dynamics (see Fig. 2B, dashed curve).

5 Discussion Computational models of synaptic transmission typically concentrate on postsynaptic mechanisms, assuming explicitly or implicitly that the amount of neurotransmitter released by each impulse is the same (Rall, 1967; Destexhe et al., 1994a). However, a key feature of the transmission process is that the number of transmitter molecules released by an impluse depends on the history of presynaptic electrical activity. The model discussed here is one of several intended to describe presynaptic transmitter release and shortterm presynaptic enhancement (Zucker & Fogelson, 1986; Parnas, Dudel, & Parneas, 1986; Yamada & Zucker, 1992). In addition to the differences discussed earlier, these other release models differ from the present one in the structure of their release sites. In Zucker and Fogelson (1986), it was assumed that the kinetics of Ca2+ binding to a release site are instantaneous, so that the rate of release is proportional to some power of the Ca2+ concentration at the release site. This model was later modified to include finite kinetic rates (Yamada & Zucker, 1992), but because the modified model includes only one slow Ca2+ binding site, facilitation depends largely on residualfree Ca2+ , contrary to a growing body of experimental data (Stanley, 1986; Blundon et al., 1993; Winslow et al., 1994). In addition, this model does not account for the multiple time scales of facilitation observed experimentally (Stanley, 1986; Magleby, 1987). Finally, the model by Parnas et al. (1986)

Transmitter Release and Facilitation

521

Figure 2: The minimal model of presynaptic facilitation is combined with simple models of postsynaptic binding and passive membrane to describe the postsynaptic voltage response to a short 100-Hz impulse train (see equations 2.3, 4.1, and 4.2). (A) Each stimulus elicits a 1-msec square pulse of transmitter (T). The magnitude of the first pulse is assumed to be 0.1 mM. The magnitude of subsequent pulses grows as the synapse facilitates. (B) The increased magnitude of T results in an increased postsynaptic voltage response with each stimulus, leading to a significant rise in the average postsynaptic voltage (solid line). In the absence of facilitation, the average voltage equilibrates at a lower level (dashed line). Postsynaptic parameter values are: αa = 2 msec−1 mM−1 , βa = 1 msec−1 , g¯ mem = 0.1, g¯ syn = 0.2 (µM cm−2 ), Vmem = −70, Vsyn = 0 (mV), and Cmem = 1 µF cm−2 .

postulates an explicit voltage dependence in the release mechanism, a feature that has not been supported by experimental data (Lando` & Zucker, 1994).

522

Richard Bertram

The present model also belongs to the family of kinetic models used to describe a wide variety of neuronal mechanisms (see Destexhe, Mainen, & Sejnowski, 1994b, for an overview). These include voltage-gated and ligandgated ion channels (Hodgkin & Huxley, 1952; Destexhe et al., 1994b); IP3 sensitive Ca2+ channels in the membrane of the endoplasmic reticulum (De Young & Keizer, 1992); postsynaptic receptors and channels (Standley, Ramsey, & Usherwood, 1993; Destexhe et al., 1994a,b); and presynaptic transmitter release sites (Parnas et al., 1986; Yamada & Zucker, 1992; Destexhe et al., 1994b; Bertram et al., 1996). In Destexhe et al. (1994b) a nonfacilitating kinetic model of transmitter release was coupled to kinetic models of postsynaptic receptor binding to provide a description of the complete synaptic signal transduction process. As demonstrated in Figure 2, the release model discussed in this article can be applied in a similar manner, introducing the key feature of presynaptic facilitation to the signal transduction scheme.

Acknowledgments I thank Arthur Sherman for many helpful discussions and for a careful reading of the manuscript. This work was performed while I was at the Mathematical Research Branch (NIDDK, NIH) in Bethesda, MD.

References Bertram, R., and Sherman, A. (In press). Population dynamics of synaptic release sites, SIAM J. Appl. Math. Bertram, R., Sherman, A., and Stanley, E. F. (1996). Single-domain/bound calcium hypothesis of transmitter release and facilitation. J. Neurophysiol., 75, 1919–1931. Blundon, J. A., Wright, S. N., Brodwick, M. S., and Bittner, G. D. (1993). Residual free calcium is not responsible for facilitation of neurotransmitter release. Proc. Natl. Acad. Sci. USA, 90, 9388–9392. De Young, G., and Keizer, J. (1992). A single pool IP3 -receptor based model for agonist stimulated Ca2+ oscillations. Proc. Natl. Acad. Sci. USA, 89, 9895–9899. Delaney, K. R., and Tank, D. W. (1994). A quantitative measurement of the dependence of short-term synaptic enhancement on presynaptic residual calcium. J. Neurosci., 14, 5885–5902. Delaney, K. R., Zucker, R. S., and Tank, D. W. (1989). Calcium in motor nerve terminals associated with posttetanic potentiation. J. Neurosci., 9, 3558–3567. Destexhe, A., Mainen, Z. F., and Sejnowski, T. J. (1994a). An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Computation, 6, 14–18.

Transmitter Release and Facilitation

523

Destexhe, A., Mainen, Z. F., and Sejnowski, T. J. (1994b). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Comput. Neurosci., 1, 195–230. Gonon, F. G. (1988). Nonlinear relationship between impulse flow and dopamine release by rat midbrain dopaminergic neurons as studied by in vivo electrochemistry. Neuroscience, 24, 19–28. Hodgkin, A. L., and Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (Lond.), 117, 500–544. Lando, ` L., and Zucker, R. S. (1994). Ca2+ cooperativity in neurosecretion measured using photolabile Ca2+ chelators. J. Neurophysiol., 72, 825–830. Li, Y.-X., Bertram, R., and Rinzel, J. (1996). Modeling N-methyl-D-aspartateinduced bursting in dopamine neurons. Neuroscience, 71, 397–410. Llin´as, R., Steinberg, I. Z., and Walton, K. (1981). Presynaptic calcium currents in squid giant synapse. Biophys. J., 33, 289–322. Magleby, K. (1987). Short-term changes in synaptic efficacy. In G. M. Edelman, L. E. Gall, W. Maxwell, & W. M. Cowan (Eds.), Synaptic Function, (pp. 21–56). New York: Wiley. Parnas, H., Dudel, J., and Parnas, I. (1986). Neurotransmitter release and its facilitation in the crayfish. VII. Another voltage dependent process besides ¨ Ca entry controls the time course of phasic release. Pflugers Arch., 406, 121– 130. Rall, W. (1967). Distinguishing theoretical synaptic potentials computed for different somadendritic distributions of synaptic inputs. J. Neurophysiol., 30, 1138–1168. Standley, C., Ramsey, R. L., and Usherwood, P. N. R. (1993). Gating kinetics of the quisqualate-sensitive glutamate receptor of locust muscle studied using agonist concentration jumps and computer simulations. Biophys. J., 65, 1379– 1386. Stanley, E. F. (1986). Decline in calcium cooperativity as the basis of facilitation at the squid giant synapse. J. Neurosci., 6, 782–789. Stanley, E. F. (1993). Single calcium channels and acetylcholine release at a presynaptic nerve terminal. Neuron, 11, 1007–1011. Winslow, J. L., Duffy, S. N., and Charlton, M. P. (1994). Homosynaptic facilitation of transmitter release in crayfish is not affected by mobile calcium chelators: Implications for the residual ionized calcium hypothesis from electrophysiological and computational analyses. J. Neurophysiol., 72, 1769–1793. Yamada, W. M., and Zucker, R. S. (1992). Time course of transmitter release calculated from simulations of a calcium diffusion model. Biophys. J., 61, 671–682. Zucker, R. S., and Fogelson, A. L. (1986). Relationship between transmitter release and presynaptic calcium influx when calcium enters through discrete channels. Proc. Natl. Acad. Sci. USA, 83, 3032–3036.

Received April 10, 1996; accepted August 9, 1996.

NOTE

Communicated by Klaus Pawelzik

Functional Periodic Intracortical Couplings Induced by Structured Lateral Inhibition in a Linear Cortical Network S. P. Sabatini G. M. Bisio PSPC Research Group, Department of Biophysical and Electronic Engineering, University of Genova, Genova, Italy

L. Raffo Department of Electrical and Electronic Engineering, University of Cagliari, Cagliari, Italy

The spatial organization of cortical axon and dendritic fields could be an interesting structural paradigm to obtain a functional specificity without postulating highly specific feedforward connections. In this article, we investigate the functional implications of recurrent intracortical inhibition when it occurs through clustered medium-range interconnection schemes (Worg ¨ otter ¨ & Koch, 1991; Somogyi, 1989; Kritzer, Cowey, & Somogyi, 1992). Moreover, the interaction between the inhibitory schemes and visual orientation maps is explored. Assuming linearity, we show that clustered inhibitory mechanisms can trigger a propagation process that allows the development of extra (i.e., induced) interactions among the cortical sites involved in the recurrent loops. In addition, we point out how these interactions functionally modify the response of cortical simple cells and yield to highly structured Gabor-like receptive fields. This study should be considered not as a realistic biological model of the primary visual cortex but as an attempt to explain possible computational principles related to intracortical connectivity and to the underlying single-cell properties.

1 The Model Assuming an “averaged model” of the visual cortex (Amari, 1977; Krone, Mallot, Palm, & Schuz, ¨ 1986; Chernjavsky & Moody, 1990), the cortical surface (x = (x1 , x2 )) can be viewed as a two-dimensional homogeneous distribution of cortical elements (nodes) characterized by specific orientation selectivities θ(x). In this context, a node represents a neural population, and interconnections represent functionally the overlap of axonal and dendritic clouds. Under linear steady-state conditions, the distribution of cortical exNeural Computation 9, 525–531 (1997)

c 1997 Massachusetts Institute of Technology °

526

S. P. Sabatini, G. M. Bisio, and L. Raffo

citation e(x) can be modeled as: Z e(x) = e0 (x) − b k f b (x; ξ )e(ξ )dξ Z = e0 (x) − b

k f f (x; ξ ; b)e0 (ξ )dξ,

(1.1)

where e0 (x) is the feedforward distribution of excitation, b is the strength of the inhibition (b > 0), and k f b (x; ξ ) is the intracortical coupling function. The feedforward resolvent kernel of the integral equation k f f (x; ξ ; b) (Winter, 1990) represents an interaction among cortical sites, telling how total afferent drive at one site affects activity at a second site. equation 1.1 can thus be interpreted as a two-dimensional iterative filter that detects specific characteristics present in the input pattern of excitation e0 (x); the spatial extension on which these characteristics are detected is possibly larger than that of k f b (x; ·). This occurs both directly, by physical local interactions, and indirectly, through the propagation property of iterative filters, which asserts that the output values of the filter can be recurrently affected by a larger neighbor region on the cortical plane. In this way, one can speak of “induced” functional couplings not directly related to the presence of the corresponding specific wirings. In the case of a translation-invariant kernel (k f b (x; ξ ) = k f b (x − ξ )), the exact solution of equation 1.1 can be straightforwardly obtained by Fourier transform techniques, and the resulting cortical activity patterns tend to have spatial frequency corresponding to the peak of the function (1 + bK f b (f))−1 , where uppercase letters refer to Fourier transforms and f = ( f1 , f2 ) is the spatial frequency vector. In particular, when inhibition arises from clusters of cells spatially distributed at a distance ±d from the target cell (see Fig. 1a), the spectrum of the feedforward resolvent kernel, K f f = K f b (1 + bK f b )−1 , shrinks, and the energy peak shifts approximately at f∗ = (±1/2d, 0) (see Fig. 1b). This result can be motivated by the following expansion: [1 + bK f b (f)]−1 = 1 − bK f b (f) + b2 K2f b (f) − b3 K3f b (f) + · · · = 1 − bK f f (f).

(1.2)

Since K f b is a low-pass filter, its powers in equation 1.2 lead to a decrease in bandwidth and so to an increasing width of the resolvent coupling function. Specifically, if we consider k f b (x − ξ ) as the sum of two laterally offset ¢ ¡ gaussians (g x1 ± d, x2 ; σk2 ), the powers of its spectrum result, in the space domain, in a sum of shifted gaussians with alternating signs, which contribute to the coupling with decreasing strength: k f f (x − ξ ) =

∞ X (−1)l−1 kl (x − ξ ) l=1

(1.3)

Functional Periodic Intracortical Couplings

527

(a)

(b)

Figure 1: (a) Schematic representation of the recurrent inhibitory kernel (k f b ), with θ = π/2, ϕ0 = θ + π/2, 1ϕ = π/8. The thick vertical line in the center represents the orientation of the target cell, which receives inhibition from the cells in the shaded area. The lateral spread of interconnections can be characterized by the angular bandwidth α related to the half-peak magnitude extension of the kernel. (b) The normalized cross-sections, along the direction ϕ0 , of |K f b (f)| (dashed line) and of |K f f (f)| for b = 0.5 (thick continuous line). Increasing the value of b (see the insets), one can observe a major redistribution of the spectrum energy around the frequency f1 = ±1/2d.

528

S. P. Sabatini, G. M. Bisio, and L. Raffo

where kl (x1 − ξ1 , x2 − ξ2 ) =

l µ ¶ l−1 ³ ´ X l b g x1 −ξ1 −(2m−l)d, x2 −ξ2 ; lσk2 .(1.4) m l m=0

This expression well evidences the regular distribution of the additionally induced couplings and their potential impact on the spatial structure of simple cell-receptive fields. It is worth noting that the geometrical organization of inhibitory couplings has a key role: whereas recurrent inhibitory couplings among clustered populations of neurons induce a regular spatial alternation of excitatory and inhibitory interactions, this behavior does not show up when inhibitory connections are uniformly distributed on the cortical plane. The functional effects of clustered recurrent inhibition are even more interesting when the structure of the interconnection schemes depends on the orientation selectivity of the target cell. In the following, we focus on inhibitory interactions occurring along the axis orthogonal to the receptive field orientation (Gilbert & Wiesel, 1989; Kisv´arday et al., 1994) (see Figs. 2a– d). Accordingly, we consider k f b as: k f b (x; ξ ) = gϕ0 (x; ξ ) + gϕ0 +1ϕ (x; ξ ) + gϕ0 −1ϕ (x; ξ ) ,

(1.5)

where gϕ represents a couple of translated (±d) two-dimensional gaussian functions (with spatial spread σk ), rotated through an angle ϕ(x) about the location of the target cell. Specifically, we choose ϕ0 (x) = θ (x) + π/2 and 1ϕ in the range [0, σk (2 ln 2)1/2 /d], the upper limit giving a maximally flat function k f b . Figures 2e–h show different resultant interaction fields for various orientation maps. Observe that smooth periodic variations in the orientation preferences are decisive to sustain the propagation mechanism and thereby to prime the periodicity of the induced couplings (see Fig. 2e). Random arrangements of orientations, in contrast, as well as high gradients in the orientation map where inhibition is present, result in a loss of coherence and hence in the attenuation or even the suppression of the contributions of recurrence (see Figs. 2f–h). For all these maps, we observe that although the recurrent kernel k f b has a uniform positive sign, the resultant interaction fields k f f ’s show areas with couplings that are negative in sign. The characteristics of these resultant interaction fields (i.e., the size, the shape, and the sign of interaction) have a complex dependence on the pattern of recurrent connections. 2 Effects on the Cortical Receptive Field Profiles Since the striate cortex is visuotopically organized, one might expect a relationship between the morphology of lateral connections and the receptive

Functional Periodic Intracortical Couplings

529

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 2: The white blobs in the top quartet figure show the feedback inhibitory kernels (k f b ) referred to the central cell of four different distributions of orientation selective cells. The orientation fields in (a–c) correspond to portions of a single orientation map (derived after Niebur and Worg ¨ otter’s ¨ model in Niebur & Worg ¨ otter, ¨ 1994) and represent different typical situations, such as an isoorientation domain (a), a pinwheel discontinuity (b), and a fracture discontinuity (c). The map in (d) represents a random arrangement of orientations. The middle quartet figure (e–h) shows the contour plots of the resulting feedforward kernels after recursion still referred to the central cell for the corresponding orientation fields. The plots point out the topology of the inhibitory-excitatory effects of the total afferent drive, on the activity of the central cell: gray lines represent positive values of k f f (i.e., inhibitory contributions), whereas black lines represent negative values of k f f (i.e., excitatory contributions). The bottom quartet (i–l) shows the corresponding receptive fields, h(·, x0 ), obtained by a numerical evaluation of equation 2.3. The parameters used in all the simulations are σk = 0.0625, 1ϕ = π/12, d = 0.25, σh = 0.1, and p = 3. The inhibition strength b is 0.5.

530

S. P. Sabatini, G. M. Bisio, and L. Raffo

field topography (Mallot, von Seelen, & Giannakopoulos, 1990; Schwarz & Bolz, 1991). To model the effect of recurrent inhibition on the linear simple cell receptive fields, we assume the superposition of contributions from the lateral geniculate nucleus (LGN) and intracortical inhibitory interactions. Specifically, functional projections of LGN afferent neurons provide a weak bias in orientation to cortical cells, and the final profiles of the receptive fields are molded by the inhibitory processes inside the cortex. From a computational point of view, assuming the linearity of simple cells (De Angelis, Ohzawa, & Freeman, 1993), the initial excitation e0 (x) is given by Z e0 (x) = h0 (x; x0 )i(x0 )dx0 , (2.1) where i(x) is the input signal and h0 (x; x0 ) is the initial receptive field profile due only to thalamic feedforward connections. The initial receptive field profile is£ modeled as ¢an oriented gaussian function: h0 (x1 ; x2 ) = ¤ ¡ (1/2πpσh2 ) exp − x21 /p2 + x22 /2σh2 , where p refers to the aspect ratio of the field, and σh specifies the receptive field size. Taking into account intracortical contributions, the resulting excitation can be expressed as Z (2.2) e(x) = h(x; x0 )i(x0 )dx0 . From equation 2.1, we can derive the expression for the resulting receptive field after inhibition: Z (2.3) h(x; x0 ) = h0 (x; x0 ) − b k f f (x; ξ ; b)h0 (ξ ; x0 )dξ. By a numerical evaluation of equation 2.3, we obtained the resultant receptive field profiles for the four orientation fields considered (see Figs. 2i–l). We note that horizontal inhibition may well be the substrate for a variety of influences observed between the receptive field center and its surround, including the refining of the orientation and spatial frequency tuning of simple cells, and the emergence of the Gabor-like receptive field profiles (Sabatini, 1996). Equation 2.3 indicates how the receptive fields, interpreted as retinotopic distributions of the activty on the cortical plane, reflect the whole spatial pattern of couplings and not only the direct neuroanatomical connectivity (von Seelen, Mallot, & Giannakopoulos, 1987). Therefore, asymmetry and anisotropy of lateral intracortical couplings manifest themselves also in their visuotopic representations, influencing the receptive fields of the cells exposed to the interconnection patterns (Martin & Whitteridge, 1984; Gilbert & Wiesel, 1989). References Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biol. Cybern., 27, 77–87.

Functional Periodic Intracortical Couplings

531

Chernjavsky, A., & Moody, J. (1990). Spontaneous development of modularity in simple cortical models. Neural Computation, 2, 334–354. De Angelis, G. C., Ohzawa, I., & Freeman, R. D. (1993). Spatiotemporal organization of simple-cell receptive fields in the cat’s striate cortex. II. Linearity of temporal and spatial summation. J. Neurophysiol., 69, 1118–1135. Gilbert, C. D., & Wiesel, T. N. (1989). Columnar specificity of intrinsic horizontal and cortico-cortical connections in cat visual cortex. J. Neuroscience, 9, 2432– 2442. Kisv´arday, Z. F., Kim, D.-S., Eysel, U. T., & Bonhoffer, T. (1994). Relationship between lateral inhibitory connections and the topography of the orientation map in cat visual cortex. Eur. J. Neurosci., 6, 1619–1632. Kritzer, M. F., Cowey, A., & Somogyi, P. (1992). Patterns of inter- and intralaminar GABAergic connections distinguish striate (V1) and extrastriate (V2, V4) visual cortices and their functionally specialized subdivisions in the rhesus monkey. J. Neuroscience, 12(11), 4545–4564. Krone, G., Mallot, H., Palm, G., & Schuz, ¨ A. (1986). Spatiotemporal receptive fields: A dynamical model derived from cortical architectonics. Proc. R. Soc. London Biol., 226, 421–444. Mallot, H. A., von Seelen, W., & Giannakopoulos, F. (1990). Neural mapping and space variant image processing. Neural Networks, 3, 245–263. Martin, K. A. C., & Whitteridge, D. (1984). The relationship of receptive field properties to the dendritic shape of neurones in the cat striate cortex. J. Physiol., 356, 291–302. Niebur, E., & Worg ¨ otter, ¨ F. (1994). Design principles of columnar organization in visual cortex. Neural Computation, 6, 602–614. Sabatini, S. P. (1996). Recurrent inhibition and clustered connectivity as a basis for Gabor-like receptive fields in the visual cortex. Biol. Cybern., 74(3), 189– 202. Somogyi, P. (1989). Synaptic organization of GABAergic neurons and GABAA receptors in the lateral geniculate nucleus and visual cortex. In D. Man-Kit Lam and C. D. Gilbert (Eds.), Neural Mechanisms of Visual Perception (pp. 35– 62) Houston: Portfolio. von Seelen, W., Mallot, H. A., & Giannakopoulos, F. (1987). Characteristics of neuronal systems in the visual cortex. Biol. Cybern., 56, 37–49. Winter, D. F. (1990). Integral equations. In C. E. Pearson (Ed.), Handbook of Applied Mathematics (pp. 512–570). New York: Van Nostrand Reinhold. Worg ¨ otter, ¨ F., & Koch, C. (1991). A detailed model of the primary vision pathway in the cat: Comparison of afferrent excitatory and intracortical inhibitory connection schemes for orientation selectivity. J. Neuroscience, 11, 1959–1979.

Received January 4, 1996; accepted August 2, 1996.

Communicated by Geoffrey Goodhill

Possible Roles of Spontaneous Waves and Dendritic Growth for Retinal Receptive Field Development Pierre-Yves Burgi Norberto M. Grzywacz Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115 USA

Several models of cortical development postulate that a Hebbian process fed by spontaneous activity amplifies orientation biases occurring randomly in early wiring, to form orientation selectivity. These models are not applicable to the development of retinal orientation selectivity, since they neglect the polarization of the retina’s poorly branched early dendritic trees and the wavelike organization of the retina’s early noise. There is now evidence that dendritic polarization and spontaneous waves are key in the development of retinal receptive fields. When models of cortical development are modified to take these factors into account, one obtains a model of retinal development in which early dendritic polarization is the seed of orientation selectivity, while the spatial extent of spontaneous waves controls the spatial profile of receptive fields and their tendency to be isotropic.

1 Introduction Retinal cells display some functionally mature receptive fields as soon as light responses can be recorded, even prior to birth or eye opening (Masland, 1977; Rusoff & Dubin, 1977; Dacheux & Miller, 1981a, 1981b; Tootle, 1993; Sernagor & Grzywacz, 1995a). In some species, early ganglion cells are even selective to the orientation of image features and the direction of their motion (Masland, 1977; Sernagor & Grzywacz, 1995a). These selectivities may in some sense be genetically specified. However, as it was well expressed by Jacobson (1991), “How essentially the same gene products form a brain [structure] at one position and a spinal cord [another structure] at another place is not given directly by the genome, but eventuates from a network of regulatory epigenetic processes.” When applied to the retina, this reasoning leads to an interest in epigenetic processes that contribute to the retina’s receptive field self-organization. Three observations may be key to the understanding of the development of retinal orientation selectivity. First, in the retina, complex receptive field properties such as orientation selectivity seem to depend on a single synaptic Neural Computation 9, 533–553 (1997)

c 1997 Massachusetts Institute of Technology °

534

Pierre-Yves Burgi and Norberto M. Grzywacz

layer (the inner plexiform layer, IPL) (Dowling, 1987). Second, developing retinas are characterized by early polarization of dendritic trees (Maslim, Webster, & Stone, 1986; Ramoa, Campbell, & Shatz, 1988; Dunlop, 1990; Vanselow, Dutting, & Thanos, 1990), which seems to impart anisotropies to immature receptive fields (Sernagor & Grzywacz, 1995a). Third, multielectrode and optical recording studies in developing retinas of cats and ferrets have revealed that waves of neural activity propagate in random directions across the ganglion cell surface and the IPL (Meister, Wong, Baylor, & Shatz, 1991; Wong, Meister, & Shatz, 1993), correlating the activity of neighbor amacrine and ganglion cells (Wong, Chernjavsky, Smith, & Shatz, 1995). Such a correlation in retinal neighbor cells has also been observed in other species (in rat, Maffei and Galli-Resta, 1990; in turtle, Sernagor & Grzywacz, 1993). It has been shown that self-organizing multilayer networks with random uncorrelated noise in the first layer, retinotopic connections (with random errors) from layer to layer, and Hebb-like rules for synaptic maturation can lead to the emergence of orientationally selective cells with similar properties to those in the cortex (von der Malsburg, 1973; Linsker, 1986; Yuille, Kammen, & Cohen, 1989; Miller, 1994). We now report the results of substituting retinal-like connections and noise for those used to simulate cortical development. We demonstrate that when Linsker’s (1986) model of self-organization of receptive fields is modified to incorporate spontaneous waves (see Fig. 1A) and early dendritic polarization (see Fig. 1B), the spatial extent of the former controls the spatial profile of receptive fields and their tendency to be isotropic and the latter introduces a bias that helps the development of orientation selectivity. (The material described in this article has appeared in abstract form as Burgi & Grzywacz, 1994a.)

2 Modification of Linsker’s Model to Include Waves and Polarized Dendrites Linsker’s model (1986) consists of a multilayer network with feedforward connections. Following MacKay and Miller’s analysis of Linsker’s model (1990), we consider two layers (besides, possibly, an input layer). The density of synapses from cells in layer A (for amacrine, in our later discussion) to a given cell in the next ascending layer G (for ganglion) assumes a gaussian distribution as function of the position of the cells in layer A. Variations in synaptic strength during development are described by a quadratic Hebbtype rule. For a given set of activities in layer A, such as determined by spontaneous uncorrelated noise or waves, this rule expresses the variations in synaptic weight ωi of synapse i as a function of its activity FiA and the overall activity FG of the postsynaptic cell. Synaptic maturation is described

Spontaneous Waves and Dendritic Growth

535

Figure 1: Two models of the developing inner plexiform layer (IPL). The ganglion cell dendritic tree, represented as a hatched circle, receives synaptic connections from (A) “dendritic-less” amacrine cells (small circles) or (B) amacrine dendrites (straight lines ending on black squares, which represent the synapses). The position of the synapses is random and follows a gaussian distribution concentric with the ganglion cell’s dendritic tree. The dendrodendritic amacrine processes are assumed to originate at the amacrine cell’s soma and point toward the ganglion cell’s soma. Waves of activity, one of which is shown propagating from top left to bottom right, are modeled by an infinitely extended straight wave front with a gaussian profile of depolarization perpendicular to the direction of propagation. The waves propagate through the IPL and correlate the activity of neighbor cells. This correlation depends on the distance separating two cells, the waves’ direction of propagation, and the orientation of the dendritic processes.

536

Pierre-Yves Burgi and Norberto M. Grzywacz

by the following dynamical equation (Linsker, 1986), X d ωi = c1 + (QijA + c2 ) Dj ωj , dt j

(2.1)

where QijA = (FiA − FA )(FjA − FA ) is the covariance matrix (the overbars deA A note average and FA = FiA = FjA ), Dj is synaptic density, c1 = k−FG 0 (F −F0 ), A and c2 = FA (FA − F0A ), with FG 0 , F0 , and k being constants of the Hebbian rule. In this and next sections, we consider the case c1 = c2 = 0 to compare the outcomes of equation 2.1 for different covariance matrices. In section 4 we analyze the behavior of the system in the c1 − c2 plane. For the original Linsker model, the noise in the layer preceding layer A is uncorrelated and random, and the arbor of connections between these layers is gaussian. Therefore, the covariance function Qu in layer A has a gaussian form (Linsker, 1986). We define σu to denote its spatial standard deviation. For comparison with the Linsker model, we calculated what the covariance function would be if the noise assumed the form of spontaneous waves in layer A instead of uncorrelated noise (see the appendix for details). We recently proposed a biophysical model for the generation and propagation of spontaneous waves of activity in the developing retina (Burgi & Grzywacz, 1994b, 1994c), based on pharmacological results (Sernagor & Grzywacz, 1993). As an approximation to the waves in the model, they were modeled here by an infinitely extended straight wave front with a gaussian profile of depolarization along the direction of propagation. In this case, by assuming the synaptic response at each instant to be proportional to the wave-induced depolarization on the amacrine cell, one obtains in the continuum limit (for details, see the appendix),

Qω (Er, Er0 ) = e−1

2

/8σω2

I0 (12 /8σω2 ),

(2.2)

where Er and Er0 are two vectors originating at the target cell in layer G (for us, a ganglion cell) and ending at two cells in layer A (for us, amacrine cells), 1 = |Er − Er0 |, σω denotes the spatial standard deviation of the waves, and I0 (x) is the zero-order Bessel function of an imaginary argument. Finally, we obtained the covariance function for the case of waves exciting the ganglion cell via polarized amacrine dendrites. The dendrite was modeled as a straight process, with a synapse at one ending of this process contacting the ganglion cell in a random position in this cell’s dendritic tree (see Fig. 1B). For simplicity, an arrow beginning at the middle of the amacrine dendrite and ending at its synapse always pointed toward the center of the ganglion cell’s tree. The synaptic response at each instant was taken to be proportional to the integral of the wave-induced depolarization

Spontaneous Waves and Dendritic Growth

537

along the length of the dendrite. The covariance function was computed as the mean covariance due to waves moving in all possible directions. This function is (see the appendix for details), 0

Qd (Er, Er ) =

Z

|Er|+L Z |Er0 |+L |Er0 |

|Er|

Ã

120 00 exp − x x2 8σω

!

Ã I0

12x0 x00 8σω2

! dx0 dx00 ,

(2.3)

where Er and Er0 are vectors originating at the ganglion cell soma and ending at amacrine synapses, 1x0 x00 = |x0 uEr − x00 uE0r |, with uEr and uE0r being unit vectors in the direction of Er and Er0 , respectively, and L is the length of the amacrine dendrite. For the three covariance functions, we assumed the synaptic density function to the ganglion cell to be a gaussian distribution of standard deviation σs , that is, 0 2

D(Er 0 ) = e−|Er |

/2σs2

.

(2.4)

This distribution reflects the dendritic arbor of the ganglion cell. 3 Eigenvector Analysis To determine what kinds of receptive field emerge from these models, one must solve equation 2.1. For this purpose, the covariance functions were first discretized on square lattices (size 17 × 17). Next, to obtain all possible outcomes of equation 2.1, we calculated the eigenvalues and eigenvectors of matrix (QijA +c2 )Dj (the general solution is a linear combination of the fun-

damental solution set {eλ1 t uE1 , eλ2 t uE2 , · · · , eλn t uEn }, where λi is the eigenvalue corresponding to the eigenvector uEi and the coefficients of the linear combination are determined by the initial conditions, plus a term corresponding to a particular solution of the nonhomogeneous equation). Inspection of equations 2.2–2.4, as well as Linsker’s covariance function, reveals that the parameter space involves only the spatial spread of the arbor to the amacrine layer in Linsker’s model (σu ), the wave’s width (σω ), the dendritic length (L), and the synaptic density spread (σs ). Because these parameters have the dimensions of space, it is possible to study the general behavior of the solutions of these equations with only three dimensionless parameters, by scaling σu , σω , and L by σs , the constant in the three models. To allow a direct comparison between Linsker’s model and our two-layer models, we assumed σu = σω = σ . Hence, the only free parameters left to explore were σ/σs and L/σs . To determine the eigenvalues and eigenvectors that solve equation 2.1, 1/2 we applied the transformation tj = Dj ωj , which symmetrizes the ma-

538

Pierre-Yves Burgi and Norberto M. Grzywacz

trix QijA Dj (MacKay & Miller, 1990). The symmetrized matrix was then transformed into a tridiagonal matrix using Householder reduction, and the eigenvalues and eigenvectors were obtained using the QL algorithm (Wilkinson & Reinsh, 1971). If one assumes (see also MacKay & Miller, 1990) that the synaptic weights must stop growing at some point, then the final receptive fields will tend to be dominated by the eigenvectors with the highest eigenvalues. This domination would not be absolute; variations among cells will occur because of initial random variations in wiring. Initial wiring and termination of development are complex processes, which may depend, for example, on transient gene expression of particular proteins or immature morphologies of dendritic trees. Consequently, to prevent obscuring the main points of the article, we decided to omit the detailed mathematical and computational analyses of the outcome given various initial conditions and termination rules. Accordingly, our results, expressed in terms of eigenvalues and eigenvectors, should be interpreted only as tendencies for symmetry breaking of the different covariance functions, and not as final incidences of various cell-types. The principal eigenfunctions in all three models (for c1 = c2 = 0) were found to be circularly symmetrical, and symmetry breaking could occur only if the next eigenfunction in line dominated by random chance (see Fig. 2). To measure the affinity of the network for symmetry breaking, we considered the ratio between the largest and second-largest eigenvalues. This ratio was high for Qω (in comparison to Linsker’s original model; see Fig. 2A). The ratio became even higher as we widened the waves. The physical reason for this tendency is that wide waves tend to correlate the activity of many amacrine cells, thus averaging out local statistical orientational biases that might exist by chance in the retinal wiring. The ratio between the two largest eigenvalues was smaller when dendrites were present than when they were not. Moreover, the symmetrical center-surround eigenvector that was in the second position for Qu and Qω was relegated to the fourth position for Qd , behind two anisotropic eigenfunctions (see Fig. 2C). Taken together, these two changes indicate that dendritic polarization biases the system toward symmetry breaking, facilitating the formation of orientationally selective cells. Although the three sets of eigenfunctions corresponding to Qu , Qω , and Qd resemble each other visually (see Fig. 2), they are not identical. For Linsker’s case, the eigenfunctions of Qu have been derived analytically by MacKay and Miller (1990). Although we could not do so for Qω and Qd due to the Bessel function, we could verify that the analytic expressions of their eigenfunctions differ from those of Qu . It is possible to see these differences reflected in the eigenfunctions. For instance, the main eigenfunctions are flatter for waves (the larger central white area) than for uncorrelated noise. Similarly, changes between positive and negative regions of the other

Spontaneous Waves and Dendritic Growth

539

Figure 2: Principal eigenfunctions of the covariance functions. The three cases illustrated in this figure correspond to the original Linsker model with uncorrelated noise (A), to a modification of that model, substituting spontaneous waves for uncorrelated noise (B), and to a modification of that model using both waves and polarized dendritic inputs (C). In each column, the eigenfunctions have the same eigenvalue, the column with the largest eigenvalue being at the left. The ratio between every two successive eigenvalues is indicated by the number between every two columns. The gray-scale bar is proportional to synaptic strength. Each eigenfunction has been normalized to use the full gray scale. Parameters were σω = 20 µm (Burgi & Grzywacz, 1994c), σu = 20 µm, σs = 50 µm, and L = 60 µm (Kolb, 1982). While the left-most eigenfunctions represent concentric receptive fields that will be favored by the model, the next receptive fields in line are orientationally selective. Other simulations with σω , σs , and L ranges of 12–100 µm, 10–80 µm, and 25–200 µm, respectively, were performed with similar results. In comparison to development with waves only, development with dendritic polarization helps symmetry breaking. Continued following page.

eigenfunctions are faster for uncorrelated noise than for waves. The wave eigenfunctions are flatter because of the “infinite” extent of the waves’ fronts, which coordinates cells over large distances. How variations in σω /σs and amacrine dendritic length affect the tendency for symmetric eigenfunctions (estimated by λ1 /λ2 ) is shown in Figure 3. Large σω /σs , corresponding to either wide waves or narrow synaptic density distributions, were found to favor concentric cells (see Fig. 3A). This

540

Pierre-Yves Burgi and Norberto M. Grzywacz

Figure 2: Continued.

result is consistent with wide waves (wide with respect to σs ), tending to synchronize the activity of large groups of cells whatever the wave’s directions of propagation. A similar interpretation can be given for uncorrelated random noise where a wide arbor of connections between layer A and the layer preceding it smoothes out local statistical biases for orientation selectivity present in the wiring, leading to more symmetry (see Fig. 3A). The effect of dendritic length on symmetry breaking is shown in Figure 3B. The value of λ1 /λ2 decreases as dendritic length increases, with the largest variations for lengths smaller or comparable to the spatial spread of the waves. Therefore, larger dendritic lengths help symmetry breaking. For very small dendritic lengths, λ1 /λ2 becomes equivalent to the case without dendrites.

Spontaneous Waves and Dendritic Growth

541

Figure 3: Effects of parameter variations on symmetry breaking. (A) The tendency for isotropy expressed as λ1 /λ2 is plotted against the spatial spread (σu ) of the arbor at the input to the amacrine cells (for the original Linsker model) or the spatial spread (σω ) of the waves along the direction of their propagation (for the models substituting waves for uncorrelated noise). These plots are either in the presence or absence of polarized dendritic inputs (L/σs = 1.2) to the ganglion cell. (B) The ratio λ1 /λ2 is plotted against dendritic length for σω /σs = 0.4. The results show that symmetry breaking becomes more likely as the size of the amacrine dendrites grows and as the spreads of the waves and arbor at the input to the amacrine cells diminish.

542

Pierre-Yves Burgi and Norberto M. Grzywacz

Finally, how variations in the waves’ width and amacrine dendritic length affect the shape of the eigenfunctions is illustrated in Figure 4. For waves of spatial spread four times wider than those used in Figure 2B, we found the same set of eigenfunctions, but more spatially extended (roughly 2.5 times more extended; see Fig. 4A). Consequently, the spatial extent of the waves may be important not only for controlling the tendency of formation of concentric receptive fields, but also may contribute to their spatial extent. (In support, there is now evidence that spontaneous waves are key to the determination of receptive field size during retinal development; Sernagor & Grzywacz, 1994, 1995a, 1995b, 1996.) In turn, although increases in receptive field size were small when dendritic length was doubled (see Fig. 4B), the order of importance of the eigenfunctions changed. The center-surround eigenfunction was removed from the set of the first six eigenfunctions (see Fig. 2C), being replaced by three-angular-nodes anisotropic eigenfunctions reminiscent of immature receptive fields with multi-axes anisotropy as observed in embryonic turtle retina (Sernagor & Grzywacz, 1995a). 4 Analysis of Behavior in the c1 − c2 Plane The parameters c1 and c2 depend on the choice of the Hebbian thresholds (FG 0 and F0A ) and the average activity in layer A (FA ). (In our models, this average activity is fixed by the waves’ properties, that is, their spatial extent, speed, and frequency of occurrence.) While parameter c2 affects the eigenvalues of the matrix (QijA + c2 )Dj , parameter c1 moves the fixed point of the dynamics with respect to the origin. Because QijA = (FiA − FA )(FjA − FA ), the entries of

this matrix depend on both its spatial properties (determined by i − j) and the overall level of activity (determined by FA ). But although the effects of the spatial properties of QijA are the central theme of this article, the square of

the overall level of activity, which is proportional to the entries of QijA , has the same dimensions as c2 . Therefore, it is possible to study the general behavior of the eigenvalues of (QijA + c2 )Dj with a single dimensionless parameter by scaling c2 by the maximum of QijA . Because this scaling is equivalent to normalizing the maximum of QijA (QiiA ) to 1, in effect, we will be considering the correlation matrices instead of the covariance ones. The parameter c1 does not affect the eigenvalues of (QijA + c2 )Dj . Rather, it

affects the coefficients of the linear combination of {eλ1 t uE1 , eλ2 t uE2 , . . . , eλn t uEn } in the solution of equation 2.1. The effect on these coefficients is such that if the magnitude of c1 is large, then it gives a significant head start to the all-excitatory (all-inhibitory) profiles. Feng, Pan, and Roychowdhury (1995) analyzed how this head start would affect the outcome of the evolution of equation 2.1 if the synaptic weights were constrained to lie between fixed

Spontaneous Waves and Dendritic Growth

543

Figure 4: Effects of parameter variations on eigenfunctions. Parameters are the same as in Figure 2, except that in (A) waves’ width is 80 µm and in (B), amacrine dendritic length is 120 µm. Comparison of A with Figure 2B shows that the width of the receptive field grows with the width of the waves. Comparison of B with Figure 2C shows that the width of the receptive field grows with the length of the amacrine dendrites.

544

Pierre-Yves Burgi and Norberto M. Grzywacz

minimum and maximum values. These authors reached the conclusion that the stable fixed-point solutions to the equation would lie in four domains where receptive-field profiles are (1) all excitatory, (2) all inhibitory, (3) all excitatory or all inhibitory, or (4) of various shapes, including anisotropic ones. The first two domains occur when c1 is large and the latter two when it is close to zero. It is possible that large c1 are relevant for retinas that have a majority of concentric ganglion cells like the primate and cat retinas. (In section 5, we will address how surround inhibition may be added to allexcitatory profiles.) However, if c1 is large, there is nothing more to discuss; receptive fields will be isotropic. Consequently, in the rest of this section, we will focus on the cases c2 > 0 and c2 < 0 for c1 = 0. The Case c2 > 0. As stated by MacKay and Miller (1990), the eigenvectors affected by c2 are those with nonzero DC components, that is, the allexcitatory (all-inhibitory), and the center-surround connection patterns (the strongest and weakest profiles in Fig. 2C, respectively). For c2 large and positive, MacKay and Miller (1990) have observed that QijA +c2 can be treated as

a first-order perturbation to the matrix c2 J, where J is the matrix Jij = 1. This

matrix has only one nonzero DC eigenvector with positive eigenvalue: the vector nE , ni = 1, which is symmetric. Consequently, the expectation is that as c2 increases, so does the tendency to develop symmetric receptive fields. A mathematical proof (MacKay & Miller, 1990) and computer simulations (see Fig. 5) confirm this expectation for the three cases we are considering. When c2 increases from zero to a large and positive value, we find that the eigenvalues (tendency) of both nonzero-DC, symmetrical profiles rise in comparison to those of the non-DC, orientationally selective. In addition, the simulations show that the tendency is for symmetrical profiles to rise faster for the uncorrelated-noise case than for the spontaneous-waves cases. Mathematically, this result follows from the Bessel function in equations 2.2 and 2.3, being a slow decreasing function and thus perturbating more c2 J than a pure gaussian function. Physically, this result is probably a consequence of the anisotropic shape of the waves imparting more orientation selectivity onto the cells. We make this conjecture because as one widens the waves and makes them more isotropic, the receptive fields themselves become more isotropic (see Figs. 3A and 4A). The Case c2 < 0. For c2 large and negative, when one performs the perturbation analysis, one finds that c2 J has anisotropic eigenvectors with eigenvalues equal to zero and that its only nonzero-DC, symmetric eigenvector, nE , has a negative eigenvalue, which is proportional to c2 . As a result, the eigenvalue of the all-excitatory (all-inhibitory) eigenvector becomes negative, making this receptive field essentially irrelevant (MacKay & Miller, 1990). Computer simulations also reveal that the eigenvalue of the centersurround eigenvector increases (and that of all other eigenvectors with

Spontaneous Waves and Dendritic Growth

545

Figure 5: Effects of the parameter c2 on eigenfunctions. All-excitatory (allinhibitory) profiles are schematically represented by circles; center-surround by black annuli, one-axis anisotropy (second column of Fig. 4B) by half-white–halfblack disks, and two-axis anisotropy (third column of Fig. 4B) by disks bisected by lines. The ordering of the profiles is like those in Figures 2 and 4, except the receptive fields here are arranged vertically, with the most important ones at the top. The parameters σu , σω , σs , and L are the same as in Figure 2. For large |c2 |, and respectively to waves, value of uncorrelated noise favors symmetrical receptive fields in comparison to spontaneous waves. In addition, this figure illustrates that the bias toward symmetry breaking resulting from dendritic polarization is independent of the choice of c2 .

546

Pierre-Yves Burgi and Norberto M. Grzywacz

nonzero-DC components) as c2 decreases from zero to a large and negative value. This result was first observed by MacKay and Miller (1990), who pointed out in their Theorem 2 that as c2 varies, the eigenvalues of the nonzero-DC eigenvectors “increase continuously and monotonically with (c2 ) between asymptotic limits such that the upper limit of one eigenvalue is the lower limit of the eigenvalue above.” Thus, as c2 → −∞, the eigenvalue of the all-excitatory (all-inhibitory) profile falls toward the upper limit of the center-surround profile; similarly, the eigenvalue of the center-surround profile also falls. However, as the original all-excitatory (all-inhibitory) profile falls, it is transformed continuously into a center-surround profile, making it appear like the eigenvalue of the center-surround eigenvector increases (MacKay and Miller 1990). Qualitatively the fall of the all-excitatory (all-inhibitory) profile and apparent rise of the center-surround profile as c2 becomes negative occur for all three covariance matrices (see Fig. 5). But for Qω and Qd , the rise of the center-surround eigenvector is slower than for Qu . As before, the mathematical explanation for this difference stems from the wider spatial extent of Qω and Qd resulting from their Bessel-function component. Again, physically, Qω and Qd ’s preference for anisotropic profiles may be due to the anisotropic shape of the waves. 5 Discussion 5.1 Possible Roles for Spontaneous Waves of Activity. Previous models for self-organization of orientation selectivity in cortex assumed uncorrelated random noise on the first layer of the network (von der Malsburg, 1973; Linsker, 1986; Yuille et al., 1989; Stetter, Lang, & Muller, ¨ 1993). This assumption is not applicable to the developing retina because there is evidence for waves of spontaneous discharges propagating across the retinal surface and, in the process, correlating the activity of neighboring cells (in cat and ferret: Meister et al., 1991; Wong et al., 1993, 1995; in rat: Maffei & Galli-Resta, 1990; in turtle, Sernagor & Grzywacz, 1993, 1995a). This correlation is observed during the period of retinal synaptogenesis (Maslim & Stone, 1986; Horsburgh & Sefton, 1987; De Juan, Grzywacz, Guardiola, & Sernagor, 1996), and dendritic growth and remodeling (Wong, Hermann, & Shatz, 1991; De Juan, Grzywacz, Guardiola, & Sernagor, 1995). Consequently, one must ponder on the role that these spontaneous correlating waves of activity may have in the formation of retinal receptive fields. We modified one of the cortical models (Linsker’s) to include waves and compared the tendencies for formation of particular types of receptive fields in the new and old models through an analysis of their covariance matrices. In this analysis, the likely types of receptive fields were described by these matrices’ eigenvectors with highest eigenvalues, and the tendency of each type of receptive field to develop was estimated by the eigenvalues. The main eigenfunctions of the Linsker and wave-modified covariance matrices were found to have similar profiles. These profiles were purely excita-

Spontaneous Waves and Dendritic Growth

547

tory concentric, orientationally selective, and excitatory-center–inhibitorysurround concentric. Wide waves tend to favor concentric receptive fields more than narrow waves (see Fig. 3A). The physical reason for the isotropy tendency of wide waves in the model is that their wide extent tends to correlate the activity of many amacrine cells, thus averaging out local statistical orientational biases that might exist by chance in the retinal wiring. Whether this averaging really favors isotropy may be debatable, since averaging also weakens the tendency for the center-surround profile (see Fig. 4A). In Figure 3, we used the all-excitatory profile to quantify isotropy, because in the retina, surround inhibition appears early in development, even before mature concentric or orientationally selective receptive fields emerge (Sernagor & Grzywacz, 1995a). This suggests that perhaps the inhibitory surround of center-surround profiles might be determined genetically, independent of spontaneous activity. If this suggestion is incorrect and the emergence of surround depends on spontaneous activity, then c2 must be negative to eliminate the all-excitatory profile (see Fig. 5). In this case, that the early retinal noise comes in the form of waves would favor the formation of orientation selectivity. As discussed in section 4, this bias toward anisotropic profiles may be due to the anisotropic shape of the waves. Another possible role for waves is contributing to the size of receptive fields. Waves tend to yield wider receptive fields than uncorrelated noise because of the large lateral extent of the wave fronts. In addition, wider waves tend to give rise to wider receptive fields (see Fig. 4A). This coupling between wave width and receptive field size could have implications for explaining receptive field size differences across species. Recent data are strongly supportive of the role of waves for both the formation of concentric receptive fields and control of their size. Normally, in turtle, the wavelike activity lasts until about postnatal day 21 (P21) and then disappears (Sernagor & Grzywacz, 1995a). Coincidentally with the disappearance of waves, the receptive field sizes mature (Sernagor & Grzywacz, 1995a). However, if one dark-rears the turtles, the wavelike activity is stronger and longer lasting (>P40), and the receptive fields become larger and more concentric (Sernagor & Grzywacz, 1994). Furthermore, the density of growth cones (and thus of dendritic growth) increases under dark rearing (De Juan et al., 1995). In turn, if one blocks the waves chronically by implanting in the retina curare-soaked Elvax (Elvax is an ethylene vinyl-acetate copolymer) (Sernagor & Grzywacz, 1995b, 1996), the receptive fields stop growing and become less concentric. Although there are other explanations for the changes in receptive field size and concentricity with dark rearing and curare implantation, these changes are generally consistent with our model. 5.2 Possible Roles for Early Dendritic Polarization. To mimic the polarization of poorly branched dendritic trees of immature cells (Maslim et al., 1986; Ramoa et al., 1988; Dunlop, 1990; Vanselow et al., 1990), we modified the original Linsker model to include (waves and) amacrine dendritic

548

Pierre-Yves Burgi and Norberto M. Grzywacz

processes pointing toward the ganglion cell’s soma and making contact with its dendritic tree. We then applied the covariance-matrix analysis to this case. The ratio between the two largest eigenvalues of the resulting covariance matrix was found to be smaller than for the model with only waves (see Fig. 3A), and this effect was more pronounced for longer dendrites (see Fig. 3B). Moreover, for long dendrites, the eigenfunction order changed as the center-surround concentric eigenfunctions were relegated to a less dominant position than anisotropic ones (see Figs. 2C and 4B). These results indicate that dendritic polarization biases the system toward orientation selectivity. Such a bias was also suggested by Sernagor and Grzywacz (1995b), who found physiological correlates of the dendritic polarization in immature retinas. Appendix This appendix shows the main steps for the derivation of equations 2.2 and 2.3. We want to determine how the activity of two amacrine cells is correlated by spontaneous waves propagating through the amacrine layer. Waves were modeled by an infinitely extended straight wave front and a gaussian profile (of standard deviation σω ) along the direction of propagation. The synaptic response at each instant was taken to be proportional to the value ¢ ¡ of the wave on the amacrine cell, that is, FA (Er) ∼ exp −(Er · uEω + x)2 /2σω2 , where Er is the vector originating at the ganglion cell soma and ending at the amacrine cell, uEω is the unit vector in the direction of the wave’s propagation, and x is the shortest distance between the ganglion cell soma and the wave’s crest (see Fig. 6). The correlation between two amacrine cells can thus be expressed as Q(Er, Er0 ) = FA (Er)FA (Er0 ), where the overbar denotes average over a large number of waves and the prime refers to another amacrine cell. The correlation was computed as the mean correlation due to waves moving in all possible directions and spatial locations, Z ∞Z π FA (Er)FA (Er0 )dx dθ, (A.1) Q(Er, Er0 ) = −∞

−π

where we are omitting the normalization factor and other constants from the analysis as they only scale the eigenvectors. Developing equation A.1, and taking into account the waves’ symmetry, we get µ ¶ Z ∞Z π (|Er| cos(α − θ ) + x)2 0 exp − Q(Er, Er ) = 2σω2 −∞ −π ¶ µ (|Er0 | cos(α 0 − θ ) + x)2 dxdθ, (A.2) × exp − 2σω2 where α and α 0 are the vector orientations of Er and Er0 , respectively, and θ is the wave’s direction of propagation. After arranging the integrand as an

Spontaneous Waves and Dendritic Growth

549

Figure 6: Schematic representation of the variables used to determine the wave’s covariance functions. Two vectors Er and Er0 (dashed arrows) originate at the ganglion cell soma (shown at the origin) and end at two amacrine synapses (represented by filled circles). The orientations of these vectors are represented by E r and u E r0 . A wave propagates with a direction represented by the unit vectors u E w and has its peak amplitude (represented by a bold, straight the unit vector u line) at a distance x from the ganglion cell. The amacrine dendrites are shown as solid lines of length L.

exponential of a perfect square, the integral in x is straightforward and gives √ σω π/2. As constants are omitted from the analysis, equation A.2 becomes Z

Ã ¡

¢2 !

|Er| cos(α − θ ) − |Er0 | cos(α 0 − θ ) exp − Q(Er, Er ) ∼ 4σω2 −π 0

π

dθ.

(A.3)

Using trigonometric manipulations, it can be shown that ¡

¢2

|Er| cos(α − θ) − |Er0 | cos(α 0 − θ )

= |Er − Er 0 |2 cos2 (θ − γ ),

(A.4)

where γ = tg−1 (|Er| sin α − |Er0 | sin α 0 )/(|Er| cos α − |Er0 | cos α 0 ). Using this definition and the new variable φ = θ − γ (the integral limits does not change as γ is an arbitrary angle), equation A.3 can be integrated with respect to φ (Gradshteyn & Ryzhik, 1980) to yield µ ¶ µ ¶ |Er − Er 0 |2 |Er − Er 0 |2 I , Q(Er, Er 0 ) ∼ exp − 0 8σω2 8σω2

(A.5)

where I0 (x) is the zero-order Bessel function of an imaginary argument.

550

Pierre-Yves Burgi and Norberto M. Grzywacz

For the calculation of the covariance function for waves hitting polarized dendrites, the dendrite was modeled as a straight process with a dendrodendritic synapse contacting the ganglion cell in a random position in this cell’s gaussian dendritic tree. An arrow beginning at the middle of the dendrite and ending at the synapse always pointed toward the middle of the ganglion cell tree. The synaptic response at each instant was taken to be proportional to the integral of the values of the wave along the length of the dendrite, that is, Z FA (Er) ∼

|Er|+L

|Er|

µ ¶ (E ur · uEω x0 + x)2 exp − dx0 , 2σω2

(A.6)

where uEr is a unit vector in the direction of Er and L is dendritic length (see Fig. 6). As before, the correlation between the activity of two amacrine cells, FA (Er) and FA (Er0 ), was computed as the mean correlation due to waves moving in all possible directions: Q(Er, Er0 ) =

µ ¶ (cos(α − θ ) x0 + x)2 exp − dx0 2σω2 −∞ −π |Er| ¶ µ Z |Er0 |+L (cos(α 0 − θ ) x00 + x)2 dx00 . × exp − (A.7) 2σω2 |Er0 | Z

Z

∞

Z

π

dx

|Er|+L

dθ

Integrating with respect to x and using the same trigonometric manipulations as before, we get 0

Q(Er, Er ) ∼

Z

|Er|+L Z |Er0 |+L

|Er|

|Er0 |

Ã

120 00 exp − x x2 8σω

!

Ã I0

12x0 x00 8σω2

! dx0 dx00

(A.8)

where 1x0 x00 = |x0 uEr − x00 uE0r |. Acknowledgments We are grateful to Kenneth D. Miller and Anthony M. Norcia for their fruitful comments on an earlier version of this article. This work was supported by a grant from the Swiss National Fund for Scientific Research (8220-37180) to P.-Y.B., by grants from the National Eye Institute (EY-08921) and U.S. Office of Naval Research (N00014-91-J-1280), by the William A. Kettlewell chair to N.M.G., and by a core grant from the National Eye Institute to Smith-Kettlewell (EY-06883).

Spontaneous Waves and Dendritic Growth

551

References Burgi, P.-Y., & Grzywacz, N. M. (1994a). Does prenatal development of ganglioncell receptive fields depend on spontaneous activity and Hebbian processes? Invest. Ophthalmol. Vis. Sci., 35, 2126. Burgi, P.-Y., & Grzywacz, N. M. (1994b). Model based on extracellular potassium for spontaneous synchronous activity in developing retinas. Neural Computation, 6, 981–1002. Burgi, P.-Y., & Grzywacz, N. M. (1994c). Model for the pharmacological basis of spontaneous synchronous activity in developing retinas. J. Neurosci., 14, 7426–7439. Dacheux, R. F., & Miller, R. F. (1981a). An intracellular electrophysiological study of the ontogeny of functional synapses in the rabbit retina. I. Receptors, horizontal and bipolar cells. J. Comp. Neurol., 198, 307–326. Dacheux, R. F., & Miller, R. F. (1981b). An intracellular electrophysiological study of the ontogeny of functional synapses in the rabbit retina. II. Amacrine cells. J. Comp. Neurol., 198, 327–334. De Juan, J., Grzywacz, N. M., Guardiola, J. V., & Sernagor, E. (1995). Anatomical and physiological dark-rearing induced changes in turtle retinal development. Invest. Ophthalmol. Vis. Sci., 36, 60. De Juan, J., Grzywacz, N. M., Guardiola, J. V., & Sernagor, E. (1996). Coincidence of synaptogenesis and emergence of spontaneous and light-evoked activity in embryonic turtle retina. Invest. Ophthalmol. Vis. Sci., 37, 634. Dowling, J. E. (1987). The retina. Cambridge, MA: Harvard University Press. Dunlop, S. A. (1990). Early development of retinal ganglion cell dendrites in the marsupial setonix brachyurus, quokka. J. Comp. Neurol., 293, 425–447. Feng, J., Pan, H., & Roychowdhury, V. P. (1995). Linsker-type Hebbian learning: A quantitative analysis on the parameter space (Tech. Rep. No. TR-EE 95-12). Purdue University. Gradshteyn, I. S., & Ryzhik, I. M. (1980). Table of integrals, series, and products. Orlando, FL: Academic Press. Horsburgh, G. M., & Sefton, A. J. (1987). Cellular degeneration and synaptogenesis in the developing retina of the rat. J. Comp. Neurol., 263, 553–566. Jacobson, M. (1991). Developmental neurobiology. New York: Plenum Press. Kolb, H. (1982). The morphology of the bipolar cells, amacrine cells and ganglion cells in the retina of the turtle Pseudemys scripta elegans. Philos. Trans. R. Soc. B, 298, 355–393. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation-selective cells. Proc. Natl. Acad. Sci. USA, 83, 8390–8394. MacKay, D. J. C., & Miller, K. D. (1990). Analysis of Linsker’s application of Hebbian rules to linear networks. Network, 1, 257–297. Maffei, L., & Galli-Resta, L. (1990). Correlation in the discharges of neighboring rat retinal ganglion cells during prenatal life. Proc. Natl. Acad. Sci. USA, 87, 2861–2864. Masland, R. H. (1977). Maturation of function in the developing rabbit retina. J. Comp. Neurol., 175, 275–286.

552

Pierre-Yves Burgi and Norberto M. Grzywacz

Maslim, J., & Stone, J. (1986). Synaptogenesis in the retina of the cat. Brain Research, 373, 35–48. Maslim, J., Webster, M., & Stone, J. (1986). Stages in the structural differentiation of retinal ganglion cells. J. Comp. Neurol., 373, 35–48. Meister, M., Wong R. O. L., Baylor, D. A., & Shatz, C. J. (1991). Synchronous bursts of action potentials in ganglion cells of the developing mammalian retina. Science, 252, 939–943. Miller, K. D. (1994). A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activitydependent competition between ON- and OFF-center inputs. J. Neurosci., 14, 409–441. Ramoa, A. S., Campbell, G., & Shatz, C. J. (1988). Dendritic growth and remodelling of cat retinal ganglion cells during fetal and postnatal development. J. Neurosci., 8, 4239–4261. Rusoff, A. C., & Dubin, M. W. (1977). Development of receptive-field properties of retinal ganglion cells in kittens. J. Neurophysiol., 40, 1188–1198. Sernagor, E., & Grzywacz, N. M. (1993). Cellular mechanisms underlying spontaneous correlated activity in the turtle embryonic retina. Invest. Ophthalmol. Vis. Sci., 34, 1156. Sernagor, E., & Grzywacz, N. M. (1994). Role of early spontaneous activity and visual experience in shaping complex receptive field properties of ganglion cells in the developing retina. Neurosci. Abst., 20, 1470. Sernagor, E., & Grzywacz, N. M. (1995a). Emergence of complex receptive field properties of ganglion cells in the developing turtle retina. J. Neurophysiol., 73, 1355–1364. Sernagor, E., & Grzywacz, N. M. (1995b). Shaping of receptive field properties in developing retinal ganglion cells in the absence of early cholinergic spontaneous activity. Neurosci. Abst., 21, 1503. Sernagor, E., & Grzywacz, N. M. (1996). Influence of spontaneous activity and visual experience on developing retinal fields. Current Biology, 6, 1503–1508. Stetter, M., Lang, E. W., & Muller, ¨ A. (1993). Emergence of orientation selective simple cells simulated in deterministic and stochastic neural networks. Biol. Cybern., 68, 465–476. Tootle, J. S. (1993). Early postnatal development of visual function in ganglion cells of the cat retina. J. Neurophysiol., 69, 1645–1660. Vanselow, J., Dutting, D., & Thanos, S. (1990). Target dependence of chick retinal ganglion cells during embryogenesis: Cell survival and dendritic development. J. Comp. Neurol., 295, 235–247. von der Malsburg, C. (1973). Self organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100. Wilkinson, J. H., & Reinsh, C., (1971). Linear algebra. New York: Springer-Verlag. Wong, R. O. L., Herrmann, K., & Shatz, C. J. (1991). Remodeling of retinal ganglion cell dendrites in the absence of action potential activity. J. Neurobiol., 22, 685–697. Wong, R. O. L., Meister, M., & Shatz, C. J. (1993). Transient period of correlated bursting activity during development of the mammalian retina. Neuron, 11, 923–938.

Spontaneous Waves and Dendritic Growth

553

Wong, R. O. L., Chernjavsky, A., Smith, S. J., & Shatz, C. J. (1995). Early functional neural networks in the developing retina. Nature, 374, 716–718. Yuille, A. L., Kammen, D. M., & Cohen, D. (1989). Quadrature and the development of orientation selective cortical cells by Hebb rules. Biol. Cybern., 61, 183–194.

Received November 14, 1995; accepted July 9, 1996.

Communicated by Klaus Pawelzik

Singularities in Primate Orientation Maps K. Obermayer Fachbereich Informatik, TU Berlin, 10587 Berlin, Germany

G. G. Blasdel Department of Neurobiology, Harvard Medical School, Boston, MA 02215, USA

We report the results of an analysis of orientation maps in primate striate cortex with focus on singularities and their distribution. Data were obtained from squirrel monkeys and macaque monkeys of different ages. We find that approximately 80% of singularities that are nearest neighbors have the opposite sign and that the spatial distribution of singularities differs significantly from a random distribution of points. We do not find evidence for consistent geometric patterns that singularities may form across the cortex. Except for a different overall alignment of orientation bands and different periods of repetition, maps obtained from different animals and different ages are found similar with respect to the measures used. Orientation maps are then compared with two different pattern models that are currently discussed in the literature: bandpass-filtered white noise, which accounts very well for the overall map structure, and the field analogy model, which specifies the orientation map by the location of singularities and their properties. The bandpass-filtered noise approach to orientation patterns correctly predicts the sign correlations between singularities and accounts for the deviations in the spatial distribution of singularities away from a random dot pattern. The field analogy model can account for the structure of certain local patches of the orientation map but not for the whole map. Neither of the models is completely satisfactory, and the structure of the orientation map remains to be fully explained. 1 Introduction Based on Hubel and Wiesel’s original findings (Hubel & Wiesel 1974), many experiments were performed to determine the two-dimensional lateral distribution of orientation selectivity in the upper layers of the primate striate cortex (Bartfeld & Grinvald 1992; Blasdel 1992a, 1992b; Blasdel & Salama 1986; Blasdel et al. 1995; Obermayer & Blasdel 1993). The results gave rise to several pattern models (Cowan & Friedman 1991; Niebur & Worg ¨ otter ¨ 1994; Rojer & Schwartz 1990; Schwartz 1994; Swindale 1982; for Neural Computation 9, 555–575 (1997)

c 1997 Massachusetts Institute of Technology °

556

K. Obermayer and G. G. Blasdel

further references see Erwin et al. 1995) designed for their interpretation. Before imaging methods were available, pattern models were used to infer a two-dimensional spatial pattern from one-dimensional or sparse twodimensional data recorded with microelectrodes. Today two-dimensional patterns can be measured directly, but pattern models are still valuable tools. Pattern models parameterize the observed data and are important for quantitative comparisons of all sorts, for example, between patterns observed in different species, between patterns observed during development, and between data and models. In this article we report an analysis of singularities in primate orientation maps (Blasdel 1992a, 1992b; Obermayer & Blasdel 1993; see Bonhoeffer & Grinvald 1991 for similar structures in the cat cortex) that were measured by optical methods. Data were obtained from squirrel monkeys and macaque monkeys of different ages. We chose these two species because macaque monkeys have ocular dominance columns while squirrel monkeys do not, and we chose macaques of different ages in order to search for changes during development. Our experimental results are then compared with pattern models: (1) bandpass-filtered white noise (Cowan & Friedman 1991; Erwin et al. 1995; Freund 1994; Niebur & Worg ¨ otter ¨ 1994; Rojer & Schwartz 1990; Schwartz 1994; Shvartsman & Freund 1994; Swindale 1982), which assumes that orientation maps can be fully described by their two-point correlation function, and (2) the field analogy model (FAM) (Wolf et al. 1994a, 1994b), which claims that orientation maps are fully determined by their singularities and that the orientation map in between is optimally smooth. The article is divided into five sections. In section 2 we describe the data, their mathematical representation, and the methods for data analysis. In the third section we introduce both pattern models. The fourth section contains the results of our data analysis and comparison between data and models, and the fifth section concludes with a brief discussion.

2 Data: Singularities and Methods of Analysis 2.1 Data. We analyzed 12 orientation maps obtained with optical imaging using standard methods (Blasdel 1992a, 1992b). The data (see Table 1) include six maps from adult macaques (varieties nemestrina and fascicularis) and five maps from baby macaques of different ages. Pictures of the spatial patterns of orientation selectivity and details of the experimental procedures can be found in the publications cited in Table 1. For the purpose of this article, each “raw” data set corresponds to a two-dimensional pixel image with 2 bytes per pixel, where the first byte codes for the orientation preference and the second byte for the orientation tuning strength. Raw data were converted to a complex orientation map, which was used for subsequent analyses. Every pixel corresponds to 8.6 µm for high-power and 15.6 µm for low-power images, respectively.

Singularities in Primate Orientation Maps

557

Table 1: Orientation Maps Used in This Study Species

Name

Age (wks)

Size (mm2 )

Macaca nemestrina

NM1a NM2a NM3a NM4a BN1b BN2b BN3b BN4b FS1a,d FS2a,d FS3b SM1c

adult adult adult adult 14.0 9.0 7.5 5.5 adult adult 3.5 adult

14.5 14.5 48.0 48.0 14.5 48.0 14.5 48.0 48.0 48.0 14.5 10.3

Macaca fascicularis

Samiri sciureus

Density of singularities (mm−2 ) pos. sign neg. sign 3.9 4.7 3.9 3.6 3.9 3.7 4.5 3.7 4.4 4.2 3.9 5.3

3.8 4.7 3.8 3.2 3.8 3.7 4.5 3.7 4.4 4.0 3.9 5.8

λOR [µm] 768 645 621 719 700 658 615 714 676 647 660 512

Note: The average wavelength λOR of the orientation map was averaged over all directions in the cortex, neglecting the anisotropy of the orientation pattern. The density of singularities does not necesssarily correlate with the average spatial frequency, because the density also depends on the coherence properties (see Obermayer et al. 1992) of the maps. a

Data from Obermayer & Blasdel (1993). Data from Blasdel et al. (1995). c Unpublished data. d Data from two adjacent regions of the same animal. b

2.2 Mathematical Description of Orientation Maps. The spatial distribution of orientation selectivity can be mathematically described by a complex orientation field O(Er) (Swindale 1982), O(Er) = q(Er) exp(2i φ(Er)),

(2.1)

where i is the imaginary unit. The phase φ(Er) denotes the preferred orientation at the cortical location Er = (x, y)T , and the amplitude q(Er) is a measure for orientation tuning. The factor of two must be introduced because orientation preferences are π periodic while the phase φ(Er) has period 2π . If we adopt this mathematical description, singularities in orientation maps correspond to the zeroes of the function O(Er). Close to a singularity, the orientation field O(Er) can be linearized. The singularity can then be described by the two parameters “location Erj = (xj , yj ) of the zero-crossing” and the four components of the normal vectors of the real and the imaginary tangential planes. For an easier interpretation of the pattern of iso-orientation lines close to a singularity, it is convenient to combine the four components of the normal vectors into four other quantities: amplitude aj ∈ R+ 0 , anisotropy αj ∈ R, angle ρj ∈ [0, π ], and skewness σj ∈ [0, π ] (Freund 1994). The orientation field Oj around a singularity j is

558

K. Obermayer and G. G. Blasdel

then given by Oj = Xj + iαj Yj ,

(2.2)

where Xj = aj {(x − xj ) cos(ρj )

+ (y − yj ) sin(ρj )}

Yj = aj {−(x − xj ) sin(ρj + σj ) + (y − yj ) cos(ρj + σj )}

(2.3)

denote the cartesian components of the orientation vector. The sign vj of a singularity j is given by vj = sign(αj ).

(2.4)

It is positive if orientation preferences increase in a clockwise direction around the singularity and negative otherwise. 2.3 Measurement of Locations, Signs, and Angles. In order to determine the position of singularities, changes of orientation preferences are added clockwise along small circles of radius 50 µm around every point in the image. If the sum 0 along a closed loop is close to ±180 degrees, then a singularity is enclosed by the circle, and the pixel is marked by +1 or −1, where ±1 is the sign vj of the singularity. Otherwise the pixel is marked by 0.1 Note that 0 assumes multiples of 180 degrees within the numerical precision of the calculation. Disks of radius 50 µm are then fitted to the regions marked +1 and −1, and the location of the center of a disk having maximum overlap with the corresponding region is taken for the spatial location of the singularity. For the calculation of orientation angles, an arbitrary direction is chosen as the cortical x-axis, and the orientation value of the pixel, which is located at a distance of 5 pixels along the x-axis from the center of singularity, is determined. 2.4 Analysis of Local Geometric Patterns of Singularities. One cannot expect to find singularities in a regular grid in a crystal-like fashion. Even if local geometric patterns exist, they are quite likely subject to changes in position, orientation, and size due to inherent biological variability or artifacts of the measurement procedures. Methods to detect local features should be robust against these changes. In the following, we consider a method that is invariant under translation, rotation, and scaling. 1 If 0 is close to ±360 degrees, then a singularity is enclosed, where orientation preferences cycle twice around it. Singularities with 0 close to ±360 degrees, however, never occur in the experimental data.

Singularities in Primate Orientation Maps

559

The local pattern Sj (Er) around a singularity j can be described by a sum of δ-functions, Sj (Er) =

X

δ(Er − Erl ),

(2.5)

l∈Dj

where Dj denotes a circular region around a singularity j. The radius d of Dj was chosen to be 275 µm (32 pixels) for high-power images and 500 µm (32 pixels) for low-power images, respectively. In order to achieve a representation of the local pattern (equation 2.5) that is invariant under translation, rotation, and scaling, we apply the following transformations: 1. Fourier transformation and neglect of phase: E = |F (Sj (Er))|. Sˆ j (k)

(2.6)

E is now invariant under translation. Sˆ j (k) 2. Transformation to polar coordinates (κ, φ), ky = κ cos φ and kx = κ sin φ: S˜ j (κ, φ) = Sˆ j ({κ cos φ, κ sin φ}T ).

(2.7)

3. Fourier transformation and neglect of phase: Tj (tE) = |F (S˜ j (κ, φ))|.

(2.8)

Tj (tE) is now invariant under rotation and scaling as well. Patterns are averaged over all singularities, which are located more than a distance d away from the border of the orientation map. 3 Models 3.1 The Field Analogy Model. Historically, the field analogy model was motivated by the similarity between contour plots of orientation maps and the lines of force between electric point charges in the plane (Wolf et al. 1994a). This similarity is based on four assumptions: 1. Iso-orientation lines do not form spirals. 2. Iso-orientation lines begin and end at singularities. 3. The complete sequence of orientation is represented only once around a singularity. 4. Singularities are isotropic and do not show skewness (σj = αj = 0).

560

K. Obermayer and G. G. Blasdel

Let us describe the electric field, which is generated by a distribution of point charges in the plane, via its complex potential U, U = 8 + i9.

(3.1)

The lines of equal 8 correspond to the iso-potential lines of the charge distribution. The lines of equal 9 correspond to the lines of force. For a set of point charges j with charge vj at locations (xj , yj ), the potential U is given by X vj Ln(z − zj ), (3.2) U=− j

where z = x + iy, zj = xj + iyj , and Ln(z) = ln |z| + i arg(z). The lines of force are then given by the lines of equal 9,  Y −v 9 = arg  (z − zj ) j  . 

(3.3)

j

For unit charges vj = ±1 we obtain    Y Y 9(x, y) = arg  {(x − xj ) − ivj (y − yj )} = arg  Oj  . 

j

(3.4)

j

Equation 3.4 defines the FAM. It assumes that the orientation preference map is fully determined by the locations and signs of its singularities. Equation 3.4 can also be derived from a biologically more relevant smoothness criterion (Wolf et al. 1994b). It has been proposed that orientation maps are characterized by the conflicting aims of continuity and diversity (Erwin et al. 1995) and that orientation preferences are mapped across the cortex as smoothly as possible under certain constraints (Obermayer et al. 1990). Let us therefore define an appropriate cost function U, U=

1 2

Z dEr |∇9(Er)|2 ,

(3.5)

V

via the magnitude of the gradient of orientation preferences. Because U is proportional to the electrostatic energy for a given distribution of point charges, it is minimized by the orientation map of equation 3.4. The FAM thus claims that orientation preference maps are optimally smooth in the sense of equation 3.5 under the constraint that its singularities are isotropic and given by {(xj , yj , vj )}. In order to test the FAM approach, locations and signs of singularities are determined from the experimental data, and the orientation map is

Singularities in Primate Orientation Maps

561

reconstructed using equation 3.4. The error R between the original map 2(x, y) and its reconstruction 9(x, y) is calculated by R2 =

ª2 1 X© g(9(x, y) − 2(x, y) + 2ref , N x,y

(3.6)

where (x, y) are pixel locations, N is the number of pixels in the sum, 9 is given by equation 3.4, 2 denotes the original orientation preference, and 2ref is chosen such that R is minimal. The function g is given by   φ+π φ g(φ) =  φ−π

if φ < −π/2 if − π/2 ≤ φ < π/2 if φ ≥ π/2.

(3.7)

Due to border effects, the sum in equation 3.6 is only over pixels that are at least λOR away from the border. 3.2 Bandpass-Filtered Noise. Bandpass-filtered noise (BFN) models are a class of pattern models that assume that orientation maps can be fully described by their two-point correlation functions. BFN was first suggested by Swindale (1982) and was discussed in more detail later (Cowan & Friedman 1991; Erwin et al. 1995; Freund 1994; Niebur & Worg ¨ otter ¨ 1994; Rojer & Schwartz 1990; Schwartz 1994; Shvartsman & Freund 1994), in particular by Schwartz and colleagues. ˆ and Let us denote the power spectrum of the orientation map O by O, let Oˆ be given by (Obermayer & Blasdel 1993; Obermayer et al. 1992)2 Ã

2 E E ∼ exp −2 (|k| − k0 ) ˆ k) O( 2 σk

! .

(3.8)

BFNs claim that the essential properties of the orientation maps are captured by patterns O(Er) = F

−1

¶ µq E E ˆ O(k) · exp(iχ(k)) ,

(3.9)

E are random variables taken from the interval [0, 2π ] where the phases χ (k) with equal probability. The only model parameters that have to be adjusted (in the isotropic case) are the wave number k0 and the width σk . Models 2 In most species we find slightly elliptical power spectra (Obermayer & Blasdel 1993). For simplicity we consider isotropic spectra as given by equation 3.8, but the formalism can easily be extended to elliptic spectra (Schwartz 1994).

562

K. Obermayer and G. G. Blasdel

similar to equation 3.9 have also been studied in the physics literature as models of optical fields that are generated when coherent light is scattered by a random medium (Freund 1994; Freund & Shvartsman 1994; Shvartsman & Freund 1994). Although BFN is primarily thought of as convenient parametrizations of orientation maps, it can be linked to processes underlying development. Swindale (1982) was among the first to point out the relation of equation 3.8 to Mexican hat–shaped cortical interactions and generated quite realistic orientation maps using BFN-type models. Later Schwartz and colleagues explored those models in detail (Rojer & Schwartz 1990; Schwartz 1994) and demonstrated how the variety of stripe and bloblike patterns, as well as the network of singularities, can emerge under simple and plausible assumptions. Linsker (1986), Miller et al. (1989), Miller (1994), and Obermayer et al. (1992) investigated models based on Hebbian learning. They showed that BFN is a robust phenomenon in models of Hebbian learning and, given that those models capture reality, also a robust phenomenon in neural development. BFN emerges from the “linear part” of the developmental dynamics, which predicts the growth of a set of independent Fourier-like eigenmodes. The shape of the BFN filter is then related to the spectrum of eigenvalues, which in turn is related to the shape of the lateral connectivity within the cortex. Consequently, several elements of BFN maps are similarly predicted by models based on Hebbian learning. Note that BFNs are not in general “optimally smooth” in the sense of equation 3.5; hence, optimal smoothness is not expected for Hebbian models either. For our numerical simulations we choose k0 = 1 in equation 3.8 without loss of generality and obtain σk ∈ [0.2, 0.26] from our experimental data (Obermayer & Blasdel 1993). Numerical simulations were performed with σk = 0.25 and a discrete lattice of units, where k0 = 1 corresponds to a period of the orientation pattern of 12.8 pixels. 4 Results 4.1 Phase-Portraits of Primate Singularities. Figure 1 shows four examples of singularities found in our data from the primate cortex. Although many singularities are isotropic, as shown in Figure 1a, visual inspection exhibits singularities with a considerable degree of anisotropy or skewness (see Figures 1b and 1c). This is particularly true for the squirrel monkey map, where iso-orientation bands align and give rise to nonisotropic singularities. 4.2 Comparison of FAM Reconstructions with Data. Figure 2 shows the iso-orientation lines of the macaque map NM1 and the squirrel monkey map SM1 in comparison with the iso-orientation lines of the corresponding FAM reconstructions (right). The macaque map shows a pattern of iso-orientation contours, which is in qualitative agreement with the experimental data in the sense that iso-orientation lines begin and end at

Singularities in Primate Orientation Maps

563

Figure 1: Contour plot representation of the orientation map around selected singularities, 400 µm × 400 µm in size, taken from NM1. Black lines indicate isoorientation contours in intervals of 12.5 degrees. Note that not all singularities are pinwheel-like as the singularity shown on the top left (a). Iso-orientation contours are often compressed along one axis and expanded along the other axis, as shown in (b) and (c). This effect is a result of finite anisotropy or skewness. Figure d shows a singularity with two axes of compression.

corresponding singularities in both maps. Details of the contour map are different (arrows), though, and lead to errors in the prediction of actual orientation preferences assigned to map locations. The reconstruction of the squirrel monkey map exhibits a contour pattern different from its original in the sense that iso-orientation lines do not begin and end at corresponding singularities (arrows). This is not surprising, because the squirrel monkey map deviates most from the assumption of isotropic singularities. Table 2 summarizes the average reconstruction errors R (cf. equation 3.6). Reconstruction errors are fairly small for small areas but approach chance

564

K. Obermayer and G. G. Blasdel

Figure 2: Contour plot representations of orientation maps for NM1 (3.3 mm × 4.4 mm) and SM1 (2.9 mm × 3.6 mm) in comparison with the corresponding FAM reconstructions. Iso-orientation contours denote intervals of 12.5 degrees for the NM1- and 15.0 degrees for the SM1-map. The arrows indicate corresponding regions in the reconstructions (right) and their originals (left).

level for larger areas.3 This shows, that the FAM captures some of the local structure in the orientation maps. In regions where anisotropic singularities are abundant, fractures occur, and, like SM1, the orientation map is strongly anisotropic, the FAM approach breaks down, and errors approach chance level. The FAM reconstructions of patterns generated by BFN lead to large errors (see Table 2). The main reason is the large number of fractures in the BFN maps, which violate the FAM assumptions of optimal smoothness: isoorientation lines in BFN maps seem to “attract” each other and form narrow 3 Errors are always below chance level because of the minimization procedure described in section 3.1.

Singularities in Primate Orientation Maps

565

Table 2: Average reconstruction Error R (14) per Pixel for the FAM Maps for an Area of Reconstruction of 4.7 mm2 (left column) and 24.1 mm2 (right column) Located in the Center of the Data Sets Name

NM1 NM2 NM3 NM4 BN1 BN2 BN3 BN4 FS1 FS2 FS3 SM1 BFN

R 4.7 mm2 (degrees)

R 24.1 mm2 (degrees)

0.69 n.a. 0.53 n.a. 0.60 0.79 0.78 0.84 0.68 n.a. 0.58 0.76 0.68 n.a. 0.50 0.73 0.81 0.85 0.68 0.73 0.64 n.a. 0.84 for 2.3 mm2 0.88

Note: The value for BFN was from a data set of 460 × 460 pixels containing approximately √ 1000 singularities. Chance level is R = π/ 12 = 0.907.

regions of rapid changes in orientation preferences (see Erwin et al. 1995, Figure 6b), while the FAM predicts that iso-orientation lines repel each other as lines of force do. The behavior of iso-orientation lines in experimental maps is somewhat between these extremes, and neither FAM nor BFN fully captures the local map structure. 4.3 Distribution of Singularities, Sign Correlations, and Orientation Angles. Figure 3 shows the locations of singularities for NM2 and SM1 in comparison with data generated by BFN. Visual inspection does not reveal any consistent local pattern of singularities but shows that nearest neighbors usually have the opposite sign. This remains true for the anisotropic map SM1, though the pattern is compressed along the left oblique and expanded along the right oblique axis. Table 3 shows the sign correlation between nearest to fifth-nearest neighbors for all data sets in comparison with BFN. Approximately 80% of nearest neighbors and 60% to 70% of the second-nearest neighbors are of the opposite sign. For singularities farther apart, same and different signs occur approximately equally often. Note that there are no significant changes with

51

199

252

45

222

54

238

242

285

48

57

3907

NM2

NM3

NM4

BN1

BN2

BN3

BN4

FS1

FS2

FS3

SM1

BFN

same diff same diff same diff same diff same diff same diff same diff same diff same diff same diff same diff same diff same diff

4 43 4 45 12 40 8 44 7 49 6 43 11 41 11 40 9 41 8 43 10 46 14 35 9 41

++ +− 9 43 10 41 9 39 8 41 2 42 6 45 15 33 7 42 8 42 8 41 6 38 9 42 10 40

First (%) −− −+ 13 87 14 86 21 79 15 85 9 91 13 87 26 74 18 82 17 83 16 84 17 83 23 77 19 81

6 6 26 22 24 25 21 32 19 32 27 29 17 32 24 28 17 34 17 33 19 32 29 27 14 35 18 33

++ +− 17 35 18 33 17 31 19 30 13 31 27 24 22 26 15 33 19 31 18 31 10 33 21 30 16 33

43 57 41 59 38 62 38 62 40 60 44 56 46 54 33 67 36 64 37 63 40 60 35 65 34 66

Second (%) −− 6 −+ 6 15 33 27 22 24 28 27 25 27 29 26 23 26 26 28 23 24 26 25 26 19 38 21 28 21 29

++ +− 33 20 31 20 19 29 24 25 18 27 22 29 11 37 22 27 29 21 26 23 15 29 26 25 21 29

Third (%) −− −+ 48 52 59 41 43 57 51 49 44 56 48 52 37 63 51 49 53 47 51 49 33 67 47 53 42 58

6 6 24 24 29 20 26 27 27 25 27 29 26 23 26 26 27 25 28 22 23 28 33 23 26 23 25 25

++ +− 26 26 22 29 25 23 27 22 27 18 22 29 24 24 27 22 24 26 19 30 23 21 23 28 24 25

Fourth (%) −− −+ 50 50 51 49 50 50 53 47 53 47 48 52 50 50 53 47 52 48 42 58 56 44 49 51 49 51

6 6 24 24 22 27 24 28 26 25 27 29 24 25 28 24 26 25 31 19 24 27 27 29 16 33 27 23

++ +− 22 30 18 33 25 23 21 28 29 16 26 25 20 28 24 25 22 28 29 20 27 17 21 30 27 23

Fifth (%) −− −+

46 54 39 61 49 51 46 54 56 44 50 50 48 52 50 50 53 48 53 47 54 46 37 63 54 46

6 6

Note: The percentage of pairs ++, −−, +−, and −+ is given for nearest to fifth-nearest neighbors. Rows marked “same” give numbers for equal sign pairs, rows marked “diff” numbers of opposite sign pairs. Percentages do not add up because of rounding errors; singularities close to the image borders have not been used.

46

Number of pair singularities same diff

NM1

Name

Table 3: Sign Correlations of Close-by Singularities

566 K. Obermayer and G. G. Blasdel

Singularities in Primate Orientation Maps

567

Figure 3: Locations of singularities for the data set NM2, SM1, and BFN. The locations of positive and negative singularities are marked by + and −. The arrows indicate the direction toward area 18. Scale bars are 1 mm in size.

age. The strong correlations in sign are consistent with the predictions of BFN. For fifth-nearest neighbors, BFN predicts a slight overrepresentation of pairs of same sign, but the data are too noisy to confirm this overrepresentation. In order to characterize the distribution of orientation angles, angles were pooled into six bins of 30 degree size and compared to the hypothesis that all orientation angles occur equally often by a χ 2 test. All sets of data except BN2 are consistent with this hypothesis for α = 0.05.4 There are no significant differences between macaque and squirrel monkeys and between animals of different ages. The predictions of BFN also pass the χ 2 test and are in agreement with the experimental data. 4.4 Distribution of Distances Between Nearest Neighbors. Figure 4 shows histograms of the distances between singularities that are nearest 4

BN2 is consistent with this hypothesis on a level of significance α = 0.01.

568

K. Obermayer and G. G. Blasdel

Figure 4: Histograms of distances between singularities that are nearest neighbors, independent of sign. Distances were normalized to the square root of the inverse density of singularities and divided into bins of size one-eighth. The percentage of distances that fall into the corresponding bins is plotted against the value corresponding to the center of the bin. Data points are connected by lines. We used 242 pairs for FS1, 238 pairs for BN4, and 3907 pairs for BFN. The percentage of pairs for RAN was calculated by integrating equation 4.1 over each bin. BFN parameters were taken from section 3.2.

neighbors, independent of sign. Let us first compare the distributions of distances with a distribution that arises from independent, randomly distributed points in the plane (RAN). Let ρ be the number density of points, √ and let 1 = 1/ ρ be the square root of the average area per point. The probability density P(d) of distances d between nearest neighbors is then given by ( µ ¶2 ) d d . P(d) = 2π exp −π 1 1

(4.1)

Figure 4 shows that the peaks of the “experimental” distributions of FS1 and BN4 are shifted to the right of the RAN distribution by approximately 0.151. Similar histograms have been constructed for all data sets and have been compared to the hypothesis RAN by a χ 2 test. The hypothesis RAN has to be rejected for all sets of data (α = 0.05), and the experimental distributions were shifted toward larger distances for all cases. There were no obvious trends between different species and between animals of different ages. A similar result was obtained for the distribution of distances between nearest neighbors of the same sign (data not shown). The hypothesis RAN has to be rejected for the same reasons, except for NM1 and BN3. Let us next compare the distribution of distances with the predictions of BFN. BFN predicts a distribution shifted by approximately 0.151 to the right of the experimental distribution; that is, singularities are farther apart

Singularities in Primate Orientation Maps

569

on average. A χ 2 test rejects the hypothesis BFN for all data sets (α = 0.05) except for BN1. For the case of singularities of the same sign (data not shown), BFN predicts a distribution that is closer to the experimental findings in the sense that peak location is shifted only slightly. A χ 2 test still rejects the hypothesis BFN for most data sets (α = 0.05), except for NM2, BN1, BN3, and SM1. Again, there were no obvious trends between different species and between animals of different ages. In summary, we are left with the puzzling fact that while singularities of opposite sign attract each other, singularities on the whole weakly repel themselves compared to a random distribution. The fact that the distribution of distances does not quite fit the BFN predictions should not be taken too seriously, however, because methodological artifacts tend to shift the distribution toward RAN (see section 5.1). 4.5 Local Patterns of Singularities. Figures 5 and 6 show an evaluation of local patterns of singularities using the procedure described in section 2.4. The average spectra T(tE) are shown as a function of tE = {ts , tr }, where the rotation axis tr is plotted to the right and the scale axis ts is plotted to the top. The oscillation with period 2 along the rotation axis reflects the fact that the patterns Sj are real and their Fourier transforms Sˆ j are conjugate complex. Any consistent local arrangement of singularities should give rise to additional structure in the spectra. The spectra of the data sets FS1, BN4, SM1, and BFN are compared with structures that arise when singularities are arranged locally in patterns with four- and sixfold symmetry (bottom row of Figure 5). Visual inspection shows additional structure in the case of the artificial patterns with six- and fourfold symmetry, which is lacking in the experimental data and in the predictions of BFN. In order to demonstrate the effect of rotational symmetry more clearly, the values of T(tE) are added along the scaling axis ts in Figure 6. Only every second value is shown in order to suppress the oscillation of period 2. Although there are strong oscillations in the case of the “square” and “triangle” patterns, no evidence for symmetries is found for BN4 and for the BFN model. Similar results were obtained for all data sets of Table 1. A similar analysis for the scaling axis did not provide evidence for symmetries either. Thus, no evidence for consistent local patterns of singularities was found in the data, which is consistent with the visual impression one gains from Figure 3. The same analysis for local patterns of same-sign singularities led to the same result (data not shown). 5 Discussion 5.1 Measurement Errors and Possible Artifacts. The reliability of optical imaging data has been addressed in several publications (e.g., Blasdel

570

K. Obermayer and G. G. Blasdel

Figure 5: Doubly transformed second amplitude-spectra T(Et) of local patterns of singularities. Each pixel corresponds to one mode, and its gray-value indicates its amplitude. The origins are in the center of the squares. The patterns are averages of 200 local regions of radius 500 µm (FS1, BN4), 100 local regions of radius 390 µm (SM1), 500 small square and hexagonal gratings in different positions, orientations, and sizes, and 500 local regions of radius 20 pixels for BFN. Each region typically contained 5 to 10 singularities. The spectra are robust with respect to the number of singularities and the size of the local regions chosen for analysis.

1992a, 1992b; Blasdel & Salama 1986; Bonhoeffer & Grinvald 1996), to which we refer the reader for details. Multiple optical recordings from the same area of the cortex, for example, lead to similar spatial maps of neuronal responses (Bonhoeffer & Grinvald 1996). The hypothesis that cells with different orientation preferences are actually arranged randomly across the cortex and that smooth maps are a filtering artifact must be rejected on the basis of electrode penetration studies (Blasdel & Salama 1986) and would also be inconsistent with earlier data (e.g., Hubel & Wiesel 1974). Another piece of evidence against artifacts is the different wavelengths of the orientation maps from different animals, which indicate biological variability rather than methodological artifacts. The sign of a singularity is a nonlocal quantity, which is robust against noise and can be measured with high reliability from the data. This is no longer true for locations, because singularities reside in areas of the map where the optical signal is weak and measurements are affected by noise.

Singularities in Primate Orientation Maps

571

Figure 6: Top row: Doubly transformed amplitude spectra T(Et) of local patterns of singularities. The corresponding second patterns of Figure 5 were added along the scaling axis, and the summed amplitudes were plotted against the rotation axis. Only every second value is shown. Strong oscillations are present in the case of four- and sixfold rotational symmetry (“square” and “triangle”). Center: Templates for square and triangular patterns. The actual patterns were generated by random translation, rotation, and scaling of the templates.

Under the assumption that positional errors are biased for neither sign nor direction in the cortex, errors in location measurements tend to randomize the sign statistics and the distribution of distances. Thus, possible positional errors cannot invalidate correlations away from a random distribution. The spatial averaging inherent in optical recordings, however, may in principle introduce artifacts because it has the effect that singularities of opposite sign attract each other, while singularities of the same sign repel themselves. In order to explain the correlations of Table 3, however, averaging artifacts must be of a size that would lead to completely deformed patterns. This would be in contradiction to the good match between optical recordings and microelectrode data. Hence, we conclude that the measured correlations in sign and location are significant. Positional errors may explain why the distribution of distances between nearest neighbors deviates from the BFN predictions. If gaussian positional noise with a standard deviation of only 0.251 (which corresponds to 100 µm

572

K. Obermayer and G. G. Blasdel

in real data) is added to the location of the singularities of the BFN map, the peaks of the nearest and next nearest neighbor distributions shift to lower values and coincide with the experimental data. Consequently, one should not regard the differences between BFN and data as significant, given the precision of location measurement. Errors in location measurements tend to randomize local patterns but cannot fully explain the failure to find local geometric patterns of singularities. As long as positional errors are well below the average distance between singularities, diagrams similar to Figure 6 would still show oscillations, albeit with less amplitude. For larger positional errors, the signature of local patterns would be lost, but errors of this size would again be in contradiction with the good match between optical recording and electrode penetration studies. 5.2 Comparison Between Models and Data, and Biological Significance. The analysis of imaging data has shown that the spatial structure of orientation preference maps is somewhat in between the predictions of the BFN model and the FAM. The FAM predicts a pattern that is too smooth and is based on the wrong assumption that all singularities are isotropic. Consequently it is a good model only in regions where neither fractures nor anisotropic singularities occur. Those regions are typically below 1 mm in size. Long-range order, which has been found in maps from cat area 17 and 18 (Wolf et al. 1994b), could not be detected in our primate data. BFN maps do not obey the optimal smoothness criterion, and FAM reconstructions of BFN maps fail. From the fact that reconstruction errors are larger for BFN than for experimental maps, it can be concluded that BFN models fail to describe fully the orientation preference map locally. BFN models predict a higher number of fractures than have been found in the data. BFN maps, however, are consistent with the pattern of singularities seen. How do sign correlations emerge? Singularities are located at the intersections of zero crossings of the real and imaginary parts of the orientation map O. Singularities, which are neighbors along a particular zero crossing of either the real or the imaginary part of O, must have the opposite sign. The sign correlations thus emerge due to the fact that nearest neighbors usually lie on the same zero crossing. As a result of the sign statistics, BFN models also exhibit long-range correlations. Given the zeros of the imaginary and the real parts of O, the sign of any single singularity automatically determines the signs of all other singularities (Freund & Shvartsman 1994). At the moment we can only speculate about the biological significance of the sign statistics and the distribution of distances between nearest neighbors. There are at least two explanations. Sign correlations and the deviations from a random distribution are predicted by BFN and are therefore also a feature of simple Hebbian-type learning. But another explanation is also possible. Orientation maps in macaques show a fairly stable wavelength

Singularities in Primate Orientation Maps

573

λOR at a time when the visual cortex is still growing (Blasdel et al. 1995). A natural explanation of this phenomenon could be that during growth, singularities are born in opposite sign pairs that subsequently drift away from each other. The fact that singularities are born in pairs could explain the sign correlations, and the fact that singularities drift away could explain the apparent repulsion. Note that no trends have been observed between animals of different ages, which is in accordance with previous findings (Blasdel et al. 1995; Wiesel & Hubel 1974). More interesting, no trends have been observed between macaques, which show ocular dominance bands, and squirrel monkeys, which do not. This is surprising for two reasons. First, the twodimensional power spectra of orientation maps from squirrel monkeys are strongly anisotropic, while power spectra of macaque data are not. Second, singularities in macaques show correlations with ocular dominance bands (Blasdel 1992b; Obermayer & Blasdel 1993), and one would have expected differences in their distribution across the cortex when compared with maps from squirrel monkeys in which this constraint is not present. The fact that no differences are present could result from the fact that correlations between singularities and ocular dominance bands are too weak to be detected by the measures used in this study. Acknowledgments We are grateful to two anonymous reviewers for their helpful comments. This research was funded in part by the Human Frontier Science Program Organization (grant RG94-98) and Deutsche Forschungsgemeinschaft (grant OB 102/2-1). We thank V. Schwanenhorst for implementing some of the C programs for local pattern analysis. References Bartfeld, E., & Grinvald, A. (1992). Relationships between orientation-preference pinwheels cytochrome oxidase blobs, and ocular dominance columns in primate striate cortex. Proc. Natl. Acad. Sci., 89, 11905–11909. Blasdel, G. G. (1992a). Differential imaging of ocular dominance and orientation selectivity in monkey striate cortex. J. Neurosci., 12, 3117–3138. Blasdel, G. G. (1992b). Orientation selectivity, preference, and continuity in monkey striate cortex. J. Neurosci., 12, 3139–3161. Blasdel, G. G., & Salama, G. (1986). Voltage sensitive dyes reveal a modular organization in monkey striate cortex. Nature, 321, 579–585. Blasdel, G. G., Obermayer, K., & Kiorpes, L. (1995). Organization of ocular dominance and orientation columns in the striate cortex of neonatal macaque monkeys. Vis. Neurosci., 12, 589–603. Bonhoeffer, T., & Grinvald, A. (1991). Orientation columns in cat are organized in pinwheel like patterns. Nature, 353, 429–431.

574

K. Obermayer and G. G. Blasdel

Bonhoeffer, T., & Grinvald, A. (1996). Optical imaging based on intrinsic signals: The methodology. In A. W. Toga and J. C. Mazziotta (Eds.), Brain mapping: The methods. 1996 (pp. 55–97). San Diego: Academic Press. Cowan, J., & Friedman, A. (1991). Simple spin models for the development of ocular dominance columns and iso-orientation patches. In R. Lippman, J. Moody, and D. Touretzky (Eds.), Advances in neural information processing systems 3 (pp. 26–31). San Mateo, CA: Morgan Kaufmann. Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neuro. Comp., 7, 425–468. Freund, I. (1994). Optical vortices in Gaussian random fields: Statistical probability densities. J. Opt. Soc. Am. A, 11, 1644–1652. Freund, I., & Shvartsman, N. (1994). Wave-field phase singularities: The sign principle. Phys. Rev. A, 50, 5164–5172. Hubel, D. H., & Wiesel, T. N. (1974). Sequence regularity and geometry of orientation columns in the monkey striate cortex. J. Comp. Neurol., 158, 267–294. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation columns. Proc. Natl. Acad. Sci. USA, 83, 8779–8783. Miller, K. D. (1994). A model for the development of simple cell receptive fields and the ordered arrangements of orientation columns through activitydependent competition between ON- and OFF-center inputs. J. Neurosci., 14, 409–441. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Niebur, E., & Worg ¨ otter, ¨ F. (1994). Design principles of columnar organization in visual cortex. Neuro. Comp., 6, 602–614. Obermayer, K., & Blasdel, G. G. (1993). Geometry of orientation and ocular dominance columns in monkey striate cortex. J. Neurosci., 13, 4114–4129. Obermayer, K., Blasdel, G. G., & Schulten, K. (1992). A statistical mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev. A, 45, 7568–7589. Obermayer, K., Ritter, H., & Schulten, K. (1990). A principle for the formation of the spatial structure of cortical feature maps. Proc. Natl. Acad. Sci. USA, 87, 8345–8349. Rojer, A. S., & Schwartz, E. L. (1990). Cat and monkey cortical columnar patterns modeled by bandpass-filtered 2d white noise. Biol. Cybern., 62, 381–391. Schwartz, E. L. (1994). Computational studies of the spatial architecture of primate visual cortex. In A. Peters & K. S. Rockland (Eds.), Cerebral cortex (Vol. 10). New York: Plenum Press. Shvartsman, N., & Freund, I. (1994). Vortices in random wave fields: Nearest neighbor anticorrelations. Phys. Rev. Lett., 72, 1008–1011. Swindale, N. V. (1982). A model for the formation of orientation columns. Proc. R. Soc. Lond. B, 215, 211–230. Wiesel, T. N., & Hubel, D. H. (1974). Ordered arrangement of orientation columns in monkeys lacking visual experience. J. Comp. Neurol., 158, 307– 318.

Singularities in Primate Orientation Maps

575

Wolf, F., Pawelzik, K., & Geisel, T. (1994a). Emergence of long-range order in maps of orientation preference. In M. Marinaro & P. G. Morasso (Eds.), Artificial neural networks I (pp. 38–41). New York: Elsevier Science Publishers. Wolf, F., Pawelzik, K., Geisel, T., Kim, D. S., & Bonhoeffer, T. (1994b). Optimal smoothness of orientation preference maps. In F. H. Eeckman (Ed.), Computation in neurons and neural systems (pp. 97–101). Newton, MA: Kluwer Academic Publishers.

Received July 25, 1995; accepted June 24, 1996.

Communicated by Ken Miller

Topographic Receptive Fields and Patterned Lateral Interaction in a Self-Organizing Model of the Primary Visual Cortex Joseph Sirosh HNC Software Inc., 5930 Cornerstone Court West, San Diego, CA 92121 USA

Risto Miikkulainen Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712 USA

This article presents a self-organizing neural network model for the simultaneous and cooperative development of topographic receptive fields and lateral interactions in cortical maps. Both afferent and lateral connections adapt by the same Hebbian mechanism in a purely local and unsupervised learning process. Afferent input weights of each neuron selforganize into hill-shaped profiles, receptive fields organize topographically across the network, and unique lateral interaction profiles develop for each neuron. The model demonstrates how patterned lateral connections develop based on correlated activity and explains why lateral connection patterns closely follow receptive field properties such as ocular dominance. 1 Introduction The response properties of neurons in many sensory cortical areas are ordered topographically; that is, nearby neurons respond to nearby areas of the receptor surface. Such topographic maps form by the self-organization of afferent connections to the cortex, driven by external input (Hubel & Wiesel 1965; Miller et al. 1989; Stryker et al. 1988; von der Malsburg 1973). Several neural network models have demonstrated how global topographic order can emerge from local cooperative and competitive lateral interactions within the cortex (Amari 1980; Kohonen 1982, 1993; Miikkulainen 1991; Willshaw & von der Malsburg 1976). These models are based on predetermined lateral interaction and focus on explaining how the afferent synaptic weights are organized. A number of recent neurobiological experiments indicate that lateral connections self-organize like the afferent connections: 1. The lateral connectivity is not uniform or genetically predetermined Neural Computation 9, 577–594 (1997)

c 1997 Massachusetts Institute of Technology °

578

Joseph Sirosh and Risto Miikkulainen

but forms during early development based on external input (Katz & Callaway 1992; Lowel ¨ & Singer 1992). 2. In the primary visual cortex, lateral connections are initially widespread, then develop into clustered patches. The clustering period overlaps substantially with the period during which ocular dominance and orientation columns form (Katz & Callaway 1992; Dalva & Katz 1994; Burkhalter et al. 1993). 3. Lateral connections primarily connect areas with similar response properties, such as columns with the same orientation or (in the strabismic case) eye preference (Gilbert 1992; Lowel ¨ & Singer 1992). 4. The lateral connections are far more numerous than the afferents and are believed to have a substantial influence on cortical activity (Gilbert et al. 1990). To account fully for cortical self-organization, a cortical map model must demonstrate that both afferent and lateral connections can organize simultaneously, from the same external input, and in a mutually supportive manner. We have previously shown how Kohonen’s self-organizing feature maps (Kohonen 1982) can be generalized to include self-organizing lateral connections (the laterally interconnected synergetically self-organizing map, LISSOM; Sirosh & Miikkulainen 1993, 1994). LISSOM is a low-dimensional abstraction of the cortical self-organizing process and models a small region of the cortex where all neurons receive the same input vector. In contrast, this article shows how realistic, high-dimensional receptive fields develop as part of the self-organization and in essence scales up the LISSOM approach to large areas of the cortex where different parts of the cortical network receive inputs from different parts of the receptor surface. This receptive field LISSOM model shows how (1) topographically ordered receptive fields develop from simple retinal images, (2) lateral connections self-organize cooperatively and simultaneously with the afferents, (3) long-range lateral connections store correlations in activity across the topographic map, and (4) the resulting lateral connection patterns closely follow receptive field properties such as ocular dominance. 2 The Receptive Field LISSOM Model The cortical network is modeled as a sheet of neurons interconnected by short-range excitatory lateral connections and long-range inhibitory lateral connections (see Figure 1). Neurons receive input from a receptive surface, or “retina,” through the afferent connections. These connections come from overlapping patches on the retina called anatomical receptive fields (RFs). The patches are distributed with a given degree of randomness. The N × N network is projected on the retina of R × R receptors, and each neuron is assigned a receptive field center (c1 , c2 ) randomly within a radius ρ ∗ R

Topographic Receptive Fields and Patterned Lateral Interaction

579

Figure 1: Architecture of the self-organizing network. The lateral excitatory and lateral inhibitory connections of a single neuron in the network are shown, together with its afferent connections. The afferents form a local anatomical receptive field on the retina.

(ρ ∈ [0, 1]) of the neuron’s projection. Through the afferent connections, the neuron receives input from receptors in a square area around the center with side s. Depending on its location, the number of afferents to a neuron could vary from 12 s × 12 s (at the corners) to s × s (at the center). The external and lateral weights are organized through an unsupervised learning process. At each training step, the neurons start out with zero activity. The initial response ηij of neuron (i, j) is based on the scalar product Ã ηij = σ

X

r1 ,r2

! ξr1 ,r2 µij,r1 r2 ,

(2.1)

where ξr1 ,r2 is the activation of a retinal receptor (r1 , r2 ) within the receptive field of the neuron, µij,r1 r2 is the corresponding afferent weight, and σ is a piecewise linear approximation of the familiar sigmoid activation function,  x≤δ  0 (x − δ)/(β − δ) δ < x < β, σ (x) = (2.2)  1 x≥β where δ and β are the lower and upper thresholds. The response evolves over time through lateral interactions. At each time step, the neuron combines

580

Joseph Sirosh and Risto Miikkulainen

retinal activation with lateral excitation and inhibition, Ã ηij (t) = σ

X

r1 ,r2

ξr1 ,r2 µij,r1 r2 + γe

−γi

X k,l

X

Eij,kl ηkl (t − δt) !

Iij,kl ηkl (t − δt) ,

(2.3)

k,l

where Eij,kl is the excitatory lateral connection weight on the connection from neuron (k, l) to neuron (i, j), Iij,kl is the inhibitory connection weight, and ηkl (t−δt) is the activity of neuron (k, l) during the previous time step. All connection weights are positive. The input activity stays constant, while the neural activity settles. The scaling factors γe and γi determine the strength of the lateral excitatory and inhibitory interactions. The activity pattern starts out diffuse and spread over a substantial part of the map, but within a few iterations of equation 2.3, it converges into a stable, focused patch of activity, or activity bubble. After the activity has settled, the connection weights of each neuron are modified. Both afferent and lateral connection weights adapt according to the same mechanism, the Hebb rule, normalized so that the sum of the weights is constant: wij,mn (t + 1) = P

wij,mn (t) + αηij Xmn £ ¤, mn wij,mn (t) + αηij Xmn

(2.4)

where ηij stands for the activity of the neuron (i, j) in the settled activity bubble, wij,mn is the afferent or the lateral connection weight (µij,r1 r2 , Eij,kl or Iij,kl ), α is the learning rate for each type of connection (αA for afferent weights, αE for excitatory, and αI for inhibitory), and Xmn is the presynaptic activity (ξr1 ,r2 for afferent, ηkl for lateral). Afferent inputs, lateral excitatory inputs, and lateral inhibitory inputs are normalized separately. The larger the product is of the pre- and postsynaptic activity ηij Xmn , the larger the weight change. Therefore, both excitatory and inhibitory connections strengthen by correlated activity; normalization then redistributes the changes so that the sum of each weight type for each neuron remains constant. 3 Self-Organization Although the self-organizing mechanism outlined above is robust, the width of the anatomical receptive fields and how ordered they are strongly affect the outcome of the process. In a series of simulations, the conditions under which self-organization could take place were studied, using networks with various receptive field widths and varying degrees of initial order. In each case, all synaptic weights were initially random: a uniformly distributed random value between zero and one was assigned to all weights, and the

Topographic Receptive Fields and Patterned Lateral Interaction

581

total weight of each connection type of each neuron was normalized to 1.0.1 Gaussian spots of “light” on the retina were used as input. At each presentation, the activation ξr1 ,r2 of the receptor (r1 , r2 ) was given by ξr1 ,r2 =

n X i=1

µ

(r1 − xi )2 + (r2 − yi )2 exp − a2

¶ ,

(3.1)

where n is the number of spots, a2 specifies the width of the gaussian, and the spot centers (xi ,yi ): 0 ≤ xi , yi < R, were chosen randomly. When the networks were trained with single light spots (n = 1), similar afferent and lateral connection structures developed in all cases. With more realistic input consisting of multiple light spots (n > 1), the networks with wide anatomical receptive fields and relatively small topographic scatter self-organized just as robustly. Figures 2 through 4 illustrate the selforganization of such a network. The initial rough pattern of afferent weights of each neuron evolved into a hill-shaped profile (see Figure 2). The afferent weight profiles of different neurons peaked over different parts of the retina, and their center of gravities (calculated in retinal coordinates) formed a topographical map (see Figure 3). Networks with large topographic scatter compared to the RF size, however, failed to develop global order with multispot inputs. It is interesting to analyze why. With single light spots, the afferent weights of an active neuron always change toward a single, local input pattern. Eventually these weights become concentrated around a local area in the RF such that the global distribution of the centers best approximates the input distribution. However, when multiple spots occur in the receptive field at the same time, the weights change toward several different locations. If the scatter is large and the neighboring receptive fields do not overlap substantially, the inputs in the nonintersecting areas change the weights the most, and the receptive fields remain crossed. If the scatter is small and the overlap high, the input activity in the intersecting region becomes strong enough to drive the weights toward topographic order. The model therefore suggests that for the self-organization of afferent connections in the cortex to be purely activity driven, the anatomical receptive fields must be large compared to the amount of initial topographic scatter among their centers.2 1 Various schemes for initializing afferent connections have been studied by other researchers. These schemes include tagging connections with chemical markers (Willshaw & von der Malsburg 1979) and establishing an initial topographic bias (Willshaw & von der Malsburg 1976; Goodhill 1993). Our focus is on receptive field width versus initial order because this factor has turned out to be most crucial in determining the success of the self-organizing process in the model presented here. 2 Even with large scatter, it is still possible to establish topographic order with additional mechanisms such as chemical markers (Willshaw & von der Malsburg 1979) or special initialization schemes (Willshaw & von der Malsburg 1976). Also, computation-

582

Joseph Sirosh and Risto Miikkulainen

(a)

Figure 2: Self-organization of the afferent input weights. The afferent weights of five neurons (located at the center and at the four corners of the network) are superimposed on the retinal surface in this figure. The retina had 21 × 21 receptors, and the receptive field radius was chosen to be 8. Therefore, neurons could have anywhere from 8 × 8 to 17 × 17 afferents depending on their distance from the network boundary. (a) Initial random weights. The anatomical RF centers were slightly scattered around their topographically ordered positions (uniformly, within a distance of 0.5 in retinal coordinates), and the weights were initialized randomly (as discussed in the text). There are four concentrated areas of weights slightly displaced from the corners and one larger one in the middle. At the corners, the profiles are taller because the normalization divides the total afferent weight among a smaller numer of connections. Continued.

The lateral connections evolve together with the afferents. By the normalized Hebbian rule (see equation 2.4), the lateral connection weights of each neuron are distributed according to how well the neuron’s activity correlates with the activities of the other neurons. As the afferent receptive fields organize into a uniform map (see Figure 3), these correlations fall ally there is no restriction on how large the receptive fields can be. They can cover the whole retina, and global order will develop because such receptive fields overlap a lot (i.e., completely). Of course, anatomically such full connectivity would not be realistic.

Topographic Receptive Fields and Patterned Lateral Interaction

583

(b)

Figure 2: Continued. (b) Final organized receptive fields. As the self-organization progresses, the weights organize into smooth hill-shaped profiles. In this simulation, each input consisted of three randomly located gaussian spots with a = 2.0. The lateral interaction strengths were γe = γi = 0.9, the learning rates αA = αE = αI = 0.002, and the upper and lower thresholds of the sigmoid 0.65 and 0.1. The map was formed in 10,000 training presentations.

off with distance approximately like a gaussian, with strong correlations to near neighbors and weaker correlations to more distant neurons. The lateral excitatory and inhibitory connections acquire the gaussian shape, and the combined lateral excitation and inhibition becomes an approximate difference of gaussians (or a “Mexican hat”; see Figure 4). 4 Ocular Dominance and Lateral Connection Patterns The activity patterns and correlations in the retina are not as simple or random as in the above experiments. Also, the visual cortex has a complex organization of orientation and ocular dominance columns, and this organization is reflected in the patterns of lateral connections. As a first step to studying how afferent and lateral connections evolve in the visual cortex, the development of ocular dominance was simulated. A second retina was added to the model, and afferent connections were set up exactly as

584

Joseph Sirosh and Risto Miikkulainen

(a)

(b)

Figure 3: Self-organization of afferent receptive fields into a topographic map. The center of gravity of the afferent weight vector of each neuron in the 42 × 42 network is projected onto the receptive surface (represented by the square). Each center of gravity point is connected to those of the four immediately neighboring neurons by a line. The resulting grid depicts the topographical organization of the map. Initially, the anatomical RF centers were only slightly scattered topographically, but because the afferent weights were initially random, the centers of gravity are scattered much more (a). As the self-organization progresses, the map unfolds and the weight vectors spread out to form a regular topographic map of the receptive surface, as shown in (b). Because the anatomical receptive fields are large in this example compared to the initial topographic scatter, the self-organization is robust with both single-spot and multispot inputs.

for the first retina (with local receptive fields and topographically ordered RF centers). Multiple gaussian spots were presented in each eye as input. Because of cooperation and competition between inputs from the two eyes, groups of neurons developed strong afferent connections to one eye or the other, resulting in patterns of ocular dominance in the network (cf. von der Malsburg 1990; Miller et al. 1989). The self-organization of the network was studied with varying betweeneye correlations. At each input presentation, one spot is randomly placed at (xi ,yi ) in the left retina, and a second spot within a radius of c × R of (xi , yi ) in the right retina. The parameter c ∈ [0, 1] specifies the spatial correlations between spots in the two retinas and can be adjusted to simulate different degrees of correlations between images in the two eyes. Multispot images can be generated by repeating the above step: the simulations that follow used two-spot images in each eye. Two different simulations are illustrated. In the first, there were no between-eye correlations in the input (c = 1),

Topographic Receptive Fields and Patterned Lateral Interaction

585

(a)

Figure 4: Self-organization of the lateral interaction. The lateral interaction profile for a neuron at position (20, 20) in the 42 × 42 network is plotted. The excitation and inhibition weights are initially randomly distributed within radii 3 and 18. The combined interaction, the sum of the excitatory and inhibitory weights, illustrates the total effect of the lateral connections. The sums of excitation and inhibition were chosen to be equal, but because there are fewer excitatory connections, the interaction has the shape of a rough plateau with a central peak (a). Continued.

thus simulating strabismus, where prominent ocular dominance patterns are known to develop. In the second case, there were positive between-eye correlations (c < 1), simulating the case of normal binocular vision. Figure 5 illustrates the response properties and lateral connectivity in the strabismic case. The connection patterns closely follow ocular dominance organization. As neurons become better tuned to one eye or the other, activity correlations between regions tuned to the same eye become stronger, and correlations between opposite eye areas become weaker. As a result, monocular neurons develop strong lateral connections to regions with the same eye preference and weak connections to regions of opposite eye preference. Most neurons are monocular, but a few binocular neurons occur at the boundaries of ocular dominance regions. Because they are equally tuned to the two eyes, the binocular neurons have activity correlations with

586

Joseph Sirosh and Risto Miikkulainen

(b)

Figure 4: Continued. During self-organization, smooth patterns of excitatory and inhibitory weights evolve, resulting in a smooth Mexican hat–shaped lateral interaction profile (b).

both ocular dominance regions, and their lateral connection weights are distributed more or less equally between them. Figure 6 shows the ocular dominance and lateral connection patterns for the normal case. The ocular dominance stripes are narrower, and there are more ocular dominance columns in the network. Most neurons are neither purely monocular nor purely binocular, and few cells have extreme values of ocular dominance. Accordingly, the lateral connectivity in the network is only partially determined by ocular dominance. However, the lateral connections of the few strongly monocular neurons follow the ocular dominance patterns as in the strabismic case. In both cases, the spacing between the lateral connection clusters matches the stripe width. The patterns of lateral connections and ocular dominance outlined closely match observations in the primary visual cortex. Lowel ¨ & Singer (1992) observed that when between-eye correlations were abolished in kittens by surgically induced strabismus, long-range lateral connections primarily linked areas of the same ocular dominance. However, binocular neurons, located between ocular dominance columns, retained connections to both eye regions. The ocular dominance stripes in the strabismics were broad

Topographic Receptive Fields and Patterned Lateral Interaction

(a)

587

(b)

Figure 5: Ocular dominance and patterned long-range lateral connections in the strabismic case. Each neuron is colored with a gray-scale value (black → white) that represents continuously changing eye preference from exclusive left, through binocular, to exclusive right. Most neurons are monocular, so white and black predominate. Small white dots indicate the strongest lateral input connections to the neuron marked with a big white dot. Only the long-range inhibitory connections are shown. The excitatory connections link each neuron only to itself and its eight nearest neighbors. (a) The lateral connections of a left monocular neuron predominantly link areas of the same ocular dominance. (b) The lateral connections of a binocular neuron come from both eye regions. In this simulation, the parameters were: network size, N = 64; retinal size, R = 24; afferent field size, s = 9, δ = 0.1, β = 0.65; spot width, a = 5.0; excitation radius, d = 1; inhibition radius = 31; scaling factors γe = 0.5, γi = 0.9; learning rates αA = αE = αI = 0.002; number of training iterations = 35, 000. The anatomical RF centers were slightly scattered around their topographically ordered positions (radius of scatter = 0.5, as in Figure 2), and all connections were initialized to random weights.

and sharply defined (Lowel ¨ 1994). In contrast, ocular dominance stripes in normal animals were narrow and less sharply defined, and lateral connection patterns overall were not significantly influenced by ocular dominance. The receptive field model reproduces these experimental results and also predicts that the lateral connections of strongly monocular neurons would follow ocular dominance even in the normal case. The model therefore confirms that patterned lateral connections develop based on correlated neuronal activity and demonstrates that they can self-organize cooperatively with ocular dominance columns.

588

Joseph Sirosh and Risto Miikkulainen

(a)

(b)

Figure 6: Ocular dominance and patterned long-range lateral connections in the normal case. The simulation was otherwise identical to that of Figure 5, except that between-eye correlations were greater than zero (c = 0.4). The stripes are narrower, and most neurons have intermediate values of ocular dominance (colored gray). Their lateral connections are only partially influenced by ocular dominance, as in (a). However, as shown in (b), the lateral connections of the most strongly monocular neurons reflect the ocular dominance organization as in the strabismic case. If the between-eye correlations are increased to one (perfectly matched spots in the two retinas), ocular dominance properties will disappear, and simple topographic maps and Mexican hat lateral interaction will result, as in Figures 2 through 4.

5 Computational Predictions of the Model RF-LISSOM is a computational model of known physiological phenomena, and it can be used to identify computational constraints that must be met for these phenomena to occur. Below, adaptation and extent of lateral excitation and inhibition, and correlation versus lateral excitation, are discussed as possible such constraints. Other computational details are included in the appendix. 5.1 Adapting Lateral Excitation and Inhibition. For proper topographic order to develop in our model, the initial lateral excitatory radius must be large enough that the network produces a single localized spot of activity for each input spot. Otherwise spurious correlations are introduced into the self-organizing process, and the map is likely to develop twists. A good general guideline is that the initial radius of excitation should be comparable

Topographic Receptive Fields and Patterned Lateral Interaction

589

to the range of activity correlations in the network. However, the excitatory radius need not be fixed. As in the self-organizing map algorithm (Kohonen 1982), the radius may start out large and gradually decrease until it covers the nearest neighbors only. Decreasing the radius in this manner makes the receptive fields gradually narrower and results in finer topographic order. This is a very robust effect and has been observed in numerous experiments. Even at a fixed excitatory radius, the lateral weights adapt, with a similar effect as adapting the excitatory radius. If the input correlations are short range, the lateral excitatory profile will become more peaked, and the inhibition will become stronger in the vicinity of each neuron (see Figure 4). Therefore, the activity patterns in the network will become smaller and more focused, as happens when the excitatory radius decreases. As a result, the receptive fields will become narrower. For example, small gaussian input blobs will produce both short-range activity correlations in the network and smaller receptive fields. The range of lateral inhibition is not a crucial parameter as long as it is greater than approximately twice the excitatory radius. If the range is less, activity patterns often split up into several separate bubbles, and distorted maps form. The weak long-range connections may be repeatedly pruned away during self-organization, as in the LISSOM model (Sirosh & Miikkulainen 1994). By weight normalization, such pruning will increase the strength of the surviving connections, thereby increasing the inhibition between the correlated areas. As a result, the activity patterns will become better focused, and more specific maps will develop. 5.2 Long-range Inhibition versus Excitation. For the self-organization of receptive fields to occur, our model requires that long-range lateral interactions be inhibitory. This is an important prediction of the model. The inhibitory interactions prevent the neural activity from spreading and saturating. As the lateral connections adapt, it is necessary that the lateral inhibition strengthens by correlated activity. If it was weakened instead, inhibition would concentrate on those units that are the least active, in effect, self-organizing itself out of the system. The activity patterns would then spread out and saturate, and highly distorted maps would result. Anatomical studies in the visual cortex show that long-range horizontal connections have mostly excitatory synapses and link primarily to other excitatory neurons (Kisvarday & Eysel 1992; Hirsch & Gilbert 1991; Gilbert et al. 1990). The crucial question is whether afferent stimulation produces mainly excitatory or inhibitory lateral responses at long range. Partial experimental support for long-range inhibition comes from optical imaging studies, which indicate that substantial lateral inhibition exists even at distances of up to 6 mm in the primary visual cortex (Grinvald et al. 1994). This range coincides well with the extent of long-range horizontal connections. Other studies, such as that of Nelson and Frost (1978), have also reported long-range inhibitory interactions without clearly identifying the anatomi-

590

Joseph Sirosh and Risto Miikkulainen

cal substrate. Due to insufficient evidence, however, the question of whether the primary effect is inhibitory or excitatory remains controversial. Until this issue is resolved and the underlying cortical circuitry is delineated, it will be difficult to determine if and how lateral inhibition in the cortex strengthens by correlated activity, as predicted by the model. 5.3 Factors Affecting the Development of Ocular Dominance. Four main factors influence how ocular dominance columns form: (1) the network size N, (2) the retina size R, (3) the excitation radius d, and (4) the between-eye correlation parameter c. For given values of the network parameters N, R, and d, the degree of between-eye correlation determines whether ocular dominance columns form. In simulations with small values of c (closely matched images in both eyes), only topographic maps develop. As c is gradually increased from zero, a sharp transition from a simple topographic map to an ocular dominance map occurs at a threshold ct ∈ (0, 1), and stable ocular dominance patterns appear. For example, in the simulations illustrated in Figures 5 and 6, N = 64, R = 24, and d = 1, and the threshold at which ocular dominance columns first appear is approximately ct = 0.25. As the network parameters change, ct appears to change in a fairly simple manner. The threshold is approximately the same for all networks for which the ratio dR/N, or the excitatory radius in retinal units, is the same. As this ratio increases (as when the excitation radius increases or the network becomes smaller), the relative size of activity bubbles on the network becomes larger, increasing correlation and causing transition to ocular dominance at higher levels of c. This is analogous to the self-organizing maps (Kohonen 1982), where an input dimension (i.e., ocular dominance) becomes represented on the map only when the variance in that dimension (c) grows large enough compared to the neighborhood radius (dR/N). Whether the lateral connections adapt is not important for the development of ocular dominance. Well-defined stripes can form even if the lateral weights are fixed, although such stripes are broader than those developed with self-organizing lateral connections. Similarly, ocular dominance columns form under a variety of boundary conditions. With self-organized lateral connections, they form at least with sharp and periodic retinal and network boundaries and with smaller or same-size anatomical RFs at the boundary, although the patterns of the stripes may be qualitatively different in each case. 6 Future Work Perceptual grouping rules such as continuity of contours, proximity, and smoothness appear as long-range activity correlations in the cortex. Von der Malsburg and Singer (1988) and Singer et al. (1990) have suggested that lateral connections store these correlations and perform perceptual grouping and segmentation by synchronizing cortical activity. The receptive field

Topographic Receptive Fields and Patterned Lateral Interaction

591

LISSOM model could be extended to test this hypothesis computationally. So far, we have demonstrated how the inhibitory lateral connections of LISSOM learn to store long-range activity correlations. On the other hand, models such as that of von der Malsburg and Buhmann (1992) show that neuronal firing can be synchronized quite effectively by inhibitory lateral interactions. These two results could be brought together into a LISSOM architecture using neurons with realistic firing properties and used to study segmentation and binding phenomena. In future research, we intend to work on self-organization in networks with firing neurons, study the knowledge extracted by lateral connections from realistic input, and determine how this knowledge could be used in feature grouping and segmentation. 7 Conclusion The receptive field LISSOM model demonstrates that cortical self-organization can be based on short-range excitation, long-range inhibition, and Hebbian weight adaptation. Starting from random-strength connections, neurons develop localized receptive fields and lateral interaction profiles cooperatively and simultaneously. Self-organization takes place with multiple retinal inputs and with multiple anatomical receptive fields of varying sizes, but requires that the receptive fields be large compared to the degree of scatter among their centers. With inputs coming from two retinas at the same time, ocular dominance stripes develop, and lateral connections primarily connect areas of the same ocular dominance. When input correlations between eyes increase, the ocular dominance stripes become narrower, and the influence of ocular dominance on the lateral connection patterns decreases. The model therefore suggests that lateral connections have a dual role in the cortex: to support self-organization of receptive fields and to learn long-range feature correlations, which may serve as a basis for perceptual grouping and segmentation. Appendix: Simulation Parameters The main parameters that influence self-organization are the sigmoid activation function and the strength of the recurrent lateral excitation and inhibition γe and γi . The activation function (see equation 2.2) affects the self-organizing process in two ways. First, the lower threshold δ determines the minimum input required for a neuron to participate in weight modification. If the weighted sum is below δ, the output activity ηij = 0, and the weights will not be modified. Only input vectors close enough to the neuron’s weight vector cause it to adapt; the higher the δ, the smaller the area of the input space stimulating the neuron. Second, for a given δ, the upper threshold β determines the gain of the neuron activation. If β is close to δ, the neuron is more nonlinear, and small differences in its input can produce

592

Joseph Sirosh and Risto Miikkulainen

large changes in its output. When β is lower, the map amplifies smaller differences in the activity pattern. Therefore, the selectivity of receptive fields can be controlled by adjusting δ and β. Good maps are usually obtained for values of δ from 0.0 to 0.2 and β from 0.55 to 0.8. The actual parameters used in each simulation are shown in the figure captions. For a given sigmoid, if the lateral excitation is too strong (high value of γe ), saturated activity spots form, and distorted maps result. If γe is too small or γi is too large, the activity spots do not become sufficiently focused, and topographic maps do not develop. When the sigmoid slope is in the range 1.0 to 2.0, the excitation radius is d = 4, and the inhibition radius is greater than 2d, the operating ranges are typically 0.8 ≤ γe ≤ 1.1 and 0.7 ≤ γi ≤ 1.4. If the excitation radius is decreased, individual weights will grow stronger because the total lateral excitation weight of each neuron is normalized to 1.0. As a result, the excitation may become too strong. To offset this problem, the value of γe will have to decrease with the excitation radius. At the excitation radius of d = 1, the operating range of γe was 0.4 ≤ γe ≤ 0.65. Acknowledgments. This research was supported in part by the National Science Foundation under grant IRI-9309273. Computing time for the simulations was provided by the Pittsburgh Supercomputing Center under grants IRI930005P and TRA940029P.

References Amari, S.-I. (1980). Topographic organization of nerve fields. Bulletin of Mathematical Biology, 42, 339–364. Burkhalter, A., Bernardo, K. L., & Charles, V. (1993). Development of local circuits in human visual cortex. Journal of Neuroscience, 13, 1916–1931. Dalva, M. B., & Katz, L. C. (1994). Rearrangements of synaptic connections in visual cortex revealed by laser photostimulation. Science, 265, 255–258. Gilbert, C. D. (1992). Horizontal integration and cortical dynamics. Neuron, 9, 1–13. Gilbert, C. D., Hirsch, J. A., & Wiesel, T. N. (1990). Lateral interactions in visual cortex. In Cold Spring Harbor Symposia on Quantitative Biology, 55, 663–677. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Goodhill, G. (1993). Topography and ocular dominance: A model exploring positive correlations. Biological Cybernetics, 69, 109–118. Grinvald, A., Lieke, E. E., Frostig, R. D., & Hildesheim, R. (1994). Cortical pointspread function and long-range lateral interactions revealed by real-time optical imaging of macaque monkey primary visual cortex. Journal of Neuroscience, 14, 2545–2568. Hirsch, J. A., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. Journal of Neuroscience, 11, 1800–1809.

Topographic Receptive Fields and Patterned Lateral Interaction

593

Hubel, D. H., & Wiesel, T. N. (1965). Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. Journal of Neurophysiology, 28, 229–289. Katz, L. C., & Callaway, E. M. (1992). Development of local circuits in mammalian visual cortex. Annual Review of Neuroscience, 15, 31–56. Kisvarday, Z. F., & Eysel, U. T. (1992). Cellular organization of reciprocal patchy networks in layer iii of cat visual cortex (area 17). Neuroscience, 46, 275–286. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kohonen, T. (1993). Physiological interpretation of the self-organizing map algorithm. Neural Networks, 6, 895–905. Lowel, ¨ S. (1994). Ocular dominance column development: Strabismus changes the spacing of adjacent columns in cat visual cortex. Journal of Neuroscience, 14(12), 7451–7468. Lowel, ¨ S., & Singer, W. (1992). Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity. Science, 255, 209–212. Miikkulainen, R. (1991). Self-organizing process based on lateral inhibition and synaptic resource redistribution. In Proceedings of the International Conference on Artificial Neural Networks (Espoo, Finland) (pp. 415–420). Amsterdam and New York: North-Holland. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Nelson, J., & Frost, B. (1978). Orientation-selective inhibition from beyond the classical visual receptive field. Brain Research, 139, 359–365. Singer, W., Gray, C., Engel, A., Konig, ¨ P., Artola, A., & Brocher, ¨ S. (1990). Formation of cortical cell assemblies. In Cold Spring Harbor Symposia on Quantitative Biology, 55, 939–952. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory. Sirosh, J., & Miikkulainen, R. (1993). How lateral interaction develops in a selforganizing feature map. In Proceedings of the IEEE International Conference on Neural Networks (San Francisco, CA), (pp. 1360–1365). Piscataway, NJ: IEEE. Sirosh, J., & Miikkulainen, R. (1994). Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics, 71(1), 66–78. Stryker, M., Allman, J., Blakemore, C., Gruel, J., Kaas, J., Merzenich, M., Rakic, P., Singer, W., Stent, G., van der Loos, H., & Wiesel, T. (1988). Group report: Principles of cortical self-organization. In P. Rakic & W. Singer (Eds.), Neurobiology of neocortex (pp. 115–136). New York: Wiley. von der Malsburg, C. (1973). Self-organization of orientation-sensitive cells in the striate cortex. Kybernetik, 15, 85–100. von der Malsburg, C. (1990). Network self-organization. In S. F. Zornetzer, J. L. Davis, & C. Lau (Eds.), An introduction to neural and electronic networks (pp. 421–432). New York: Academic Press. von der Malsburg, C., & Buhmann, J. (1992). Sensory segmentation with coupled neural oscillators. Biological Cybernetics, 67, 233–242. von der Malsburg, C., & Singer, W. (1988). Principles of cortical network organization. In P. Rakic & W. Singer (Eds.), Neurobiology of neocortex (pp. 69–99). New York: Wiley.

594

Joseph Sirosh and Risto Miikkulainen

Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London B, 194, 431–445. Willshaw, D. J., & von der Malsburg, C. (1979). A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem. Philosophical Transactions of the Royal Society of London Biol., 287, 203–243.

Received January 10, 1994; accepted May 2, 1996.

Communicated by Ralph Linsker

The Formation of Topographic Maps That Maximize the Average Mutual Information of the Output Responses to Noiseless Input Signals Marc M. Van Hulle K.U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, Leuven, Belgium

This article introduces an extremely simple and local learning rule for topographic map formation. The rule, called the maximum entropy learning rule (MER), maximizes the unconditional entropy of the map’s output for any type of input distribution. The aim of this article is to show that MER is a viable strategy for building topographic maps that maximize the average mutual information of the output responses to noiseless input signals when only input noise and noise-added input signals are available.

1 Introduction Ralph Linsker was among the first to introduce information-theoretic principles into the field of self-organizing neural networks. He devised a global learning rule for topographic map formation by maximizing the average mutual information (Shannon information rate) between the output and the signal part of the input, which is corrupted by noise (Linsker 1989). He argued that mutual information maximization is an optimal way to extract statistically salient input features. Others have devised learning rules from computational principles that are different yet related to the former: (1) principal component analysis (Foldi´ ¨ ak 1989; Rubner & Schulten 1990; Sanger 1989)—and its nonlinear extension, which is modeled by Kohonen’s selforganizing (feature) map (SOM) algorithm (Ritter et al. 1992)—as a way to account for most of the input’s variance but not for the noise, and (2) smoothing and predictive filtering (Atick & Redlich 1991), for which the limiting case approximates mutual information maximization. Linsker also introduced a two-stage, local learning rule for maximizing the average mutual information when the input distribution is multivariate gaussian (Linsker 1992). The aim of this article is to introduce a simple learning strategy for building topographic maps that maximize the average mutual information of the map outputs, in response to noiseless inputs, by using input noise and noise-added input signals only. The application envisaged is density estimation. Neural Computation 9, 595–606 (1997)

c 1997 Massachusetts Institute of Technology °

596

Marc M. Van Hulle

2 Maximum Entropy Learning Rule The aim of topographic map formation is to develop, in an unsupervised way, a mapping from a higher d-dimensional space V of input signals onto an equal or lower-dimensional discrete lattice A of N formal neurons. To each formal neuron i ∈ A corresponds a unique weight vector wi = [wi1 , . . . , wid ] defined in V-space. As a result of this mapping, the formal neurons quantize the input space V, with probability density function (p.d. f.) p(v), into a discrete number of partition cells or quantization regions. The definition of a quantization region is, in this case, different from the one used in a Voronoi (Dirichlet) tessellation. Assume that V and A have the same dimensionality d and that A comprises a periodic and regular d-dimensional lattice with a rectangular topology. Since the topology is rectangular, the lattice consists of a number of d-dimensional quadrilaterals (see Figure 1A). For example, in Figure 1B, the quadrilateral labeled He is defined by the neurons i, j, k, and m: the weights of neurons i, j, k, and m delimit a region in V-space. Hence, each quadrilateral represents a “granular” quantization region in V-space. We also consider the “imaginary” quadrilaterals facing the outer border of the lattice that represent the “overload” quantization regions (see Figure 1B). To each of the Q quadrilaterals of A, we associate a code membership function: ( 1lHj (v) =

2d nHj

0

if v ∈ Hj

(2.1)

if v 6∈ Hj ,

with nHj , 1 ≤ nHj ≤ 2d , the number of vertices of Hj . We assume that p(v) is stationary and ergodic and that the probability that v falls on one of the links equals zero: hence, a single input will activate two or more quadrilaterals only when these quadrilaterals overlap. (Note that several quadrilaterals may be active at the same time when the lattice is tangled.) Define Si as the set of 2d quadrilaterals that have neuron i as a common vertex (e.g., for neuron j in Figure 1B, Sj comprises Hh , Hi , He , H f ). Neuron i is said to be active if at least one of the quadrilaterals of the set Si is active (i.e., with nonzero code membership function output). The d-dimensional maximum entropy learning rule (MER) is defined as 1wi = η

X

1lHj (v) Sgn(v − wi ),

∀i ∈ A,

(2.2)

Hj ∈Si

with η the learning rate (a positive constant) and Sgn(.) the sign function acting componentwise. The set Si guarantees that only the vertices of the active quadrilaterals are updated. In appendix A, we show that the average of equation 2.2 converges since it performs stochastic gradient descent on a positive definite cost function E (i.e., a Lyapunov function).

The Formation of Topographic Maps

597

Figure 1: Definition of quantization region. (A) Portion of a two-dimensional lattice A with a rectangular topology, represented in V-space. To every line crossing (vertex) corresponds a neuron weight vector. The bold line shows the outer border of the lattice. (B) The same portion of the lattice but with some of the neurons and quadrilaterals labeled. The quadrilaterals inside the lattice, labeled He , H f , Hh , and Hi , are the granular quantization regions; those that are labeled Ha , Hb , Hc , Hd , and Hg are the overload regions, and they are constructed by extending the links connecting the neuron on the border with its immediate neighbors (dashed lines).

We know that the unconditional entropy is maximal when each of the N neurons is active with equal probability. In appendix A we show that for the one-dimensional case, MER is guaranteed to yield a unique and equiprobable activity distribution: P(1) = · · · = P(i) = · · · P(N) = N1 , with P(i) the probability that neuron i is active; for the general d-dimensional case we show that the output (weight) density is proportional to the input density for large N (i.e., an equiprobable quantization). Furthermore, also in appendix A, we show that a tangled lattice cannot be a stable configuration in the d-dimensional (as well as in the one-dimensional) case. 3 Mutual Information Maximization From the previous section we know that MER will yield a topographic map that maximizes the unconditional entropy of its N binary outputs, given the input distribution p(v). The binary outputs result from quantizing the input signal v (i.e., the 1lHj ’s). Define F(y, x) as the mutual information between the network’s binary output y and the network’s desired binary output x in response to noiseless input signals. The mutual information can be written as F(y, x) = I(y) − I(y | x),

(3.1)

598

Marc M. Van Hulle

with I(y) the unconditional entropy of y and I(y | x) the conditional entropy of y, given x. By maximizing F, we maximize the correspondence between the y and x vectors. Assume that we dispose of two types of input signals: the input signal v, which is the sum of a clean input signal s and input noise n (additive noise), and the input noise signal n. Define further ps as the clean signal distribution, pn as the noise distribution, and psn as the distribution of v, that is, the input signal distribution in the presence of additive noise. Linsker (1992) suggested maximizing the mutual information between the signal portion of the noisy input (additive noise) and the output response to the noisy input, and introduced a learning strategy in which gradient ascent was performed on F directly, with a negative learning rate for I(y | x). Here we will propose a different, much simpler learning strategy for maximizing our definition of average mutual information. We can rewrite equation 3.1 as F(y, x) = I(x) − I(x | y), with I(x) the unconditional entropy of x, and with I(x | y) the average uncertainty about x when we know y (equivocation). Hence, a y vector does not contribute to I(x | y) when its activation region coincides with that of an x vector, or when it is a subset thereof. Since I(x | y) is expected to decrease as the distribution of y vectors approaches the desired distribution of x vectors, the mutual information F will be maximal when I(x | y) is minimal. (Note that max(F) = max(I(x)) = log N.) Hence, F is a function of the binary output vectors y only, and the problem reduces to one of determining the topographic map that maximizes the unconditional entropy of the distribution of output responses to ps . This will be achieved in the following way. We use MER to derive two maps: one that maximizes the unconditional entropy of the map’s output corresponding to psn and another that maximizes it for pn . The purpose of this entropy maximization is to derive nonparametric models of both input distributions: since entropy maximization leads to an equiprobable quantization and since, by the latter, the weight distribution closely approximates the input signal distribution for large N, MER can be used for reconstructing the shape of psn and pn (e.g., using splines). (Note that by the separation of density estimation from density reconstruction, such an approach is different from other adaptive nonparametric density estimation approaches, such as adaptive kernel- and nearestneighbor density estimation; see Silverman 1986.) Once psn and pn are modeled in this way, we can infer the clean signal distribution: since the noise is additive, psn equals the convolution of ps and pn . Hence, in principle, we can obtain the desired clean signal distribution ps by deconvolution. We can then reapply MER to determine the weight vectors that maximize the unconditional entropy of the distribution of output responses to ps and, thus, that maximize F. For the sake of exposition, we have used in the simulations a procedure that is even simpler than deconvolution. Assume that both input distributions are univariate gaussians and that N is odd. Hence, at convergence, the averaged weights wmid , with mid = (N − 1)/2 + 1, represent also the

The Formation of Topographic Maps

599

means of the corresponding distributions. The weights for p ps are inferred from those for psn and pn in the following way: wjs = 1 − ρ 2 wjsn with ρ = wjn − wnmid /wjsn − wsn mid , ∀j, j 6= mid. Evidently this leads to the correct solution when psn and pn are gaussians; for other unimodal, symmetric distributions, with small higher-than-second-order moments, it is an approximation. Finally, we note that the estimation of psn and pn is a common procedure in model-based noise-reduction schemes. For example, in speech applications, one obtains estimates of (environmental) noise when no speech signal is present. Knowledge of the underlying input distributions greatly improves the quality of, for example, blind signal identification (Pajunen et al. 1996), speech enhancement (Xie & Van Compernolle 1994), speech recognition (Rabiner & Juang 1993) and also for hybrid systems (based on mutual information maximization; see Rigoll 1994), adaptive waveform coding (Martinez & Yang 1995), and adaptive channel equalization (Adali et al. 1995). 4 Simulations We run two networks, one for psn and another for pn , until convergence and determine the weights of ps , as explained in the previous section. We then compute numerically the neural activations for ps given these weights. Finally, we display the performance of our MER-based procedure. Since entropy is not a very sensitive metric, due to its logarithmic nature, we plot the mean squared error (MSE) between the present neural activations for ps and the desired equiprobable ones. For MER, we use the fast version described in appendix B; the learning rate is η = 0.0001. We will give two examples. First, we consider the standard case where the samples from ps and pn are randomly and independently drawn from zero-mean gaussians with standard deviations σs and σn , respectively. The result is shown in Figure 2A for σs = 0.5 and different values for σn and for N = 17 and 33. We observe that the MER-based procedure yields a rather accurate estimate of ps up to σn = 0.5 (signal-to-noise ratio—SNR—of 0 dB). Second, in order to show that our MER-based procedure also works for nongaussian distributions, we consider the case where ps is a zero-mean symmetric product distribution and pn a zero-mean uniform distribution. The symmetric product distribution is a strongly peaked p.d. f. that is generated by taking the product of two that are uniformly dis√ √ random numbers tributed within the ranges [0, 1/ 2) or (−1/ 2, 0]; both ranges are equally probable. The range of the product distribution is (−1/2, 1/2) and that of the uniform distribution is [−a, a). The product distribution falls off log hyperbolically from the center. The result is shown in Figure 2B. We observe that the MER-based procedure yields a significant improvement at least up to a = 0.5 (SNR = −4.8 dB). In order to verify the effectiveness of our “gaussian” approximation, we have applied a deconvolution procedure in

600

Marc M. Van Hulle

Figure 2: Mutual information maximization performance of the MER-based procedure. The performance is expressed as the MSE between the neural activations for the clean signal distribution and the desired, equiprobable ones. The thin and thick solid lines show the MSE result for N = 17 and 33, respectively. The thin and thick dashed lines show the corresponding MSE results in case only psn is maximized. This serves as a reference against which the performance of the MER-based procedure is to be compared. (A) Result in case of zero-mean guassians for the clean and the noise signal with standard deviations σs and σn , respectively; σs = 0.5. (B) Result in case of the symmetric product distribution with range (−1/2, 1/2) for the clean signal and the uniform √ distribution with range [−a, a) for the noise signal; σs ∼ = 0.166 and σn = a/ 3. The fourth moment of the clean signal ∼ = 2.5 10−3 . The thin and thick dot-dash lines denote the results obtained in case of deconvolution for N = 17 and 33, respectively.

which MER is used for estimating psn and pn (the deconvolution procedure is detailed in appendix C). However, we observe that deconvolution yields inferior results, at least for the range of a shown (dot-dash lines). The inferior performance is due to the presence of the sharp peak in ps , which makes the latter difficult to reconstruct. We also observe that the procedure breaks down below a = 0.2. Finally, we note that the I(y) values obtained in our simulations were typically within 2% of the theoretically attainable range, although the weights could be significantly different from the optimal ones. Hence, we feel that performing gradient ascent on I(y) directly is not recommended by its logarithmic nature, especially when the (decrease in the) progress made in I(y) is used in the stopping criterion. Appendix A: Properties of MER Proposition 1.

For a discrete and statistically stationary input probability den-

The Formation of Topographic Maps

601

sity p(v), equation 2.2 on average (for small enough η) performs (stochastic) gradient descent on the cost function E=

N X M X X

1lHj (vk )|vk − wi |,

(A.1)

k=1 i=1 Hj ∈Si

with {vk } the set of input signals, and |vk − wi | = “per-letter” distortion).

Pd

k j=1 |vj − wij | (i.e., the absolute

Proof. Following the definition of gradient descent, the weights are updated so as to reduce the cost term introduced by the active quantization regions (defined by the weight vectors of the previous iteration step). Hence, since equation A.1 is continuous and piecewise differentiable, we obtain h1wi iV = −η

M X X ∂E =η 1lHj (vk ) Sgn(vk − wi ), ∂wi k=1 H ∈S j

∀i ∈ A,

(A.2)

i

and the latter exactly corresponds to the right-hand side of equation 2.2 summed over all input patterns. Since E is always positive definite and its partial derivatives (equation A.2) are continuous, equation A.1 is a Lyapunov function. For the case where p(v) is continuous, we modify the parallel update scheme of MER to a sequential one: (1) one quadrilateral out of the set of active quadrilaterals is chosen randomly, and (2) one vertex out of the vertices of the randomly chosen quadrilateral is updated using MER. Proposition 2. For a continuous and statistically stationary input p.d.f. p(v) and a sequential update scheme, a Lyapunov function exists for averaged MER. Proof. The derivatives of the average learning rule comprise a matrix HE = [HEij ], HEij = ∂h1wi iV /∂wj , with components: HEij = 0, i 6= j, and HEii 6= 0. Since the matrix HE is symmetric, it is a Hessian. Hence, if and only if HE is a Hessian, a Lyapunov function exists. Proposition 3. In the one-dimensional case, MER is guaranteed to yield a set of equiprobable neurons. Proof. Consider a one-dimensional lattice with quantization intervals H1 , . . . , HN+1 defined by the weights w1 , . . . , wN . Due to the existence of a Lyapunov function, we know that MER will converge on average.

602

Marc M. Van Hulle

We have that h1wi iV = 0 = h1lHi+1 − 1lHi iV =

2 P(Hi+1 ) 2 P(Hi ) − , nHi+1 nHi

i = 1, . . . , N,

(A.3)

at convergence, with P(Hi ) the probability that Hi is active (i.e., P(1lHi 6= 0)). Now since, by definition, neuron i is activated when either Hi or Hi+1 or both are activated, the probability that neuron i is activated becomes P(i) =

P(Hi ) P(Hi+1 ) + n Hi nHi+1

P (where we have normalized to keep i P(i) = 1). If we now substitute for the P(Hi )-equalities obtained from equation A.3, we obtain that P(1) = · · · = P(i) = · · · P(N), and thus a set of equiprobable neurons because nHi = 2, 1 < i < N + 1 and nH1 = nHN+1 = 1. Corollary 1. In the one-dimensional case, the averaged neuron weights w1 , . . . , wN at convergence are the medians of the intervals they delimit. Proof.

Follows directly from the equiprobable quantization.

Define a kink at neuron i as the topological defect for which either wi > wi+1 and wi > wi−1 , or wi < wi+1 and wi < wi−1 . Proposition 4. In the one-dimensional case, MER is guaranteed to converge on average to a one-dimensional lattice without kinks. Proof by indirect demonstration. Assume the inverse: a one-dimensional lattice with at least one kink is a stable solution. Assume that there is a kink at neuron i; hence, the intervals Hi and Hi+1 overlap and are located at the same side of i. This signifies that neuron i will receive weight updates only in the direction of both overlapping intervals, and never in the opposite direction. By consequence, neuron i will keep on shifting until the overlap is removed. Hence, the kink at neuron i is not a stable configuration. Since kinks are the only topological defects of one-dimensional lattices, the latter two propositions also signify that MER converges to a unique set of equiprobable neurons: P(1) = · · · = P(i) = · · · P(N) = 1/N. Before we turn to the d-dimensional case, we need an additional definition. Since MER performs weight updates componentwise, we can divide the region within which neuron i can be activated into 2d quadrants: for every input falling in the same quadrant, the magnitude and direction of the weight update vector are the same (cf. the sign functions in MER).

The Formation of Topographic Maps

603

For the higher-than-one-dimensional case, one can easily verify that for each neuron i, the (averaged) weight vector wi at convergence represents the d-dimensional median of Si on condition that the median is defined as the vector of the (scalar) medians in each of the d input dimensions separately. (Note that there exists no unique definition of median in d dimensions.) The weights satisfying this condition are stable equilibrium points of our averaged learning rule. Furthermore, when all opposite quadrants of a neuron carry nonzero activation probabilities, these quadrants will be equiprobable. Proposition 5. In the d-dimensional case, the output (weight) density λ(wi ) at convergence is proportional to the input density p(v), when N grows large and given that the neurons’ activation regions Si span nonzero volumes in V-space. Proof. In the asymptotic situation where N is large, the discrete weight distribution can be expected to approximate a continuous density function λ having unit volume. Then λ 1Vol may be taken as the fraction of weight vectors (or neurons) located in an incremental volume element 1Vol. Thus, the volume of the quantization region Si associated with neuron i is given approximately by Vol(Si ) ≈

1 N λ(wi )

(A.4)

for every bounded region Si (λ 6= 0). (Note that the denominator is the number of points in the neighborhood of wi so that the reciprocal is the volume per neuron.) For large N it is reasonable to assume that most of the regions Si will be bounded sets, and the contribution from the “overload” regions will be negligible. Furthermore, also for large N, we can make the approximation that p(v) ∼ p(wi ), ∀v ∈ Si (given Vol(Si ) 6= 0). We then obtain for the probability for neuron i to be active, .

P(i) = Vol(Si ) p(wi ) ≈

p(wi ) , Nλ(wi )

(A.5)

after substitution of equation A.4. We can also rewrite the contribution made by neuron i to the cost function associated with MER as follows: Z Z p(wi ) |v − wi |dv = |v − wi |dv (A.6) Ei = P(i) Nλ(wi ) Si Si (or with a summation over Si instead of an integral in case of a discrete input distribution). We know that E is a Lyapunov function; hence, at convergence, ∂Ei /∂wi = 0, and wi will be the “median” of Si (our definition). We have also assumed

604

Marc M. Van Hulle

that the volume Vol(Si ) 6= 0; hence, the integral term in equation A.6 will always be positive. Taking the derivative of equation A.6 yields Z p(wi ) ∂ p(wi ) ∂Ei + =0= |v − wi |dv ∂wi ∂w Nλ(w ) Nλ(w i i i) Si Z ∂ |v − wi |dv. × ∂wi Si

(A.7)

We know that the derivative of the integral equals zero, since wi is the “median” of Si , and that the integral is positive and nonzero. Hence, the derivative of the ratio p(wi )/λ(wi ) must be equal to zero: in other words, the ratio is a constant. In conclusion, we have that λ(wi ) ∝ p(v). Furthermore, since each quadrilateral is common to 2d adjacent neurons, λ will be a smooth function. Corollary 2. In the d-dimensional case, we have an equiprobable quantization for large N, given λ a smooth function. Proof. Since for large N and λ smooth, the ratio p/λ reduces to a constant independent of the lattice index i, we have that each Si contributes equally in terms of activation. Hence, we have an equiprobable quantization. Proposition 6.

In the d-dimensional case, a tangled lattice is not a stable solution.

Proof by indirect demonstration. Assume the inverse: a lattice with at least one topological defect is a stable solution. Without loss of generality, we can consider a two-dimensional lattice. In case of a topological defect, such as at neuron i, at least two quadrilaterals will overlap. Assume that (part of) the overlap is located in the upper-left quadrant QUL and that an input falls in it: hence, quadrant QUL is activated once. However, due to the twofold overlap, neuron i will be updated twice. (Note that a single weight update is the same for each quadrant.) As a result, the update probabilities do not match the activation probabilities of the assumed equiprobable quadrants, and the average learning rule cannot be stable. Hence, the locally tangled lattice cannot be a stable configuration. Appendix B: Fast Version of One-Dimensional MER In the one-dimensional case, MER becomes: 1wi = η(1lHi+1 − 1lHi ),

i = 1, . . . , N.

(B.1)

The average squared change of weights per iteration (intensity of variation) at convergence is 2η2 /N + 1, for i = 2, N − 1, and 3η2 /N + 1 for i = 1 or N.

The Formation of Topographic Maps

605

The rate of convergence for a p.d. f. with bounded support and for which the weights are initialized outside the p.d. f.’s range is on the order of O(η/N2 ). A faster version of MER is found by updating all weights at each time step: Ã

N+1 X

i X 1lHk 1lHk − 1wi = η N + 1 − i k=1 i k=i+1

! ,

i = 1, . . . , N.

(B.2)

By considering the homogeneous system of averaged equations B.2 at convergence, it can be easily shown that the faster version of MER also yields a set of equiprobable neurons. The intensity of variation is now η2 /(N + 1 − i)i, for i = 2, N − 1, and η2 (2N + 1)/(N + 1)N for i = 1 or N. The rate of convergence is on the order of O(η/N). Hence, for the same N, the faster version of MER converges N times faster and generates an N times smaller intensity of variation at convergence. Hence, we have used this rule in the simulations. Appendix C: Deconvolution The samples taken from pn and psn are quantized using MER. At the midpoints of the resulting intervals, the density estimate is located, and the shape of the underlying density curve is approximated using cubic splines. This approximation is then resampled into 8192 equidistant points in the range [−4, 4]. Deconvolution is performed via the complex cepstrum (Oppenheim & Schafer 1975). The process of deconvolution is quite sensitive to the accuracy with which the shapes of pn and psn are obtained. Therefore, we additionally use optimal filtering: we apply a bandpass filter to psn of which the (integer) low and high cutoff frequencies are chosen in such a way that the convolution of the estimated shape of ps with the shape of pn is as close as possible to the shape of psn (in MSE sense). The adaptation of the cutoff frequencies is nonrecursive: it simply yields a table from which the best combination is chosen. Once the best estimated shape of ps is obtained, the equiprobable intervals are determined numerically by integration (extended trapezoidal rule). The correctness of these intervals is then verified using the true shape of ps : the difference from an equiprobable quantization is expressed in terms of an MSE value and plotted in Figure 2B. Acknowledgments I thank the anonymous reviewers for their valuable comments. I am a research associate of the National Fund for Scientific Research (Belgium), and supported by research grants received from the National Fund (G.0185.96) and the European Commission (ECVnet EP8212).

606

Marc M. Van Hulle

References Adali, T., Liu, X., Li, N., & Sonmez, ¨ M. K. (1995). A maximum partial likelihood framework for channel equalization by distribution learning. Proc. IEEE NNSP95 (pp. 541–550). Atick, J. J., & Redlich, A. N. (1991). Predicting ganglion and simple cell receptive field organizations. Intl J. Neural Syst., 1, 305–315. Foldi´ ¨ ak, P. (1989). Adaptive network for optimal linear feature extraction. In Int’l Joint Conference on Neural Networks (Washington 1989), 1, pp. 401–405. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1, 402– 411. Linsker, R. (1992). Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Computation, 4, 691–702. Martinez, D., & Yang, W. (1995). A robust backward adaptive quantizer. Proc. IEEE NNSP95 (Cambridge, MA, 1995), (pp. 531–540). Oppenheim, A. V., & Schafer, R. W. (1975). Digital signal processing. Englewood Cliffs, NJ: Prentice Hall. Pajunen, P., Hyv¨arinen, A., & Karhunen, J. (1996). Nonlinear blind source separation by self-organizing maps. In S.-I. Amari, L. Xu, L.-W. Chun, I. King, and K.-S. Leung (Eds.), Progress in neural information processing (pp. 1207– 1210). Proceedings of the International Conference on Neural Information Processing. Rabiner, L., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: Prentice Hall. Rigoll, G. (1994). Maximum mutual information neural networks for hybrid connectionist-HMM speech recognition systems. IEEE Trans. Speech and Audio Processing, 2, 175–184. Ritter, H., Martinetz, T., & Schulten, K. (1992). Neural computation and selforganizing maps: An introduction. Reading, MA: Addison-Wesley. Rubner, J., & Schulten, K. (1990). Development of feature detectors for selforganization. Biol. Cybern., 62, 193–199. Sanger, T. (1989). An optimality principle for unsupervised learning. In D. S. Touretzky (Ed.), Advances in neural information processing systems 1 (pp. 11–19). San Mateo, CA: Morgan Kaufmann. Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall. Xie, F., & Van Compernolle, D. (1994). A family of MLP based nonlinear spectral estimators for noise reduction. Proc. ICASSP’94, 2 (pp. 53–56).

Received August 23, 1995; accepted May 6, 1996.

Communicated by Alexandre Pouget

Self-Organization of Firing Activities in Monkey’s Motor Cortex: Trajectory Computation from Spike Signals Siming Lin Jennie Si Department of Electrical Engineering, Arizona State University, Tempe, AZ 85287 USA

A. B. Schwartz Neurosciences Institute, San Diego, CA 92121 USA

The population vector method has been developed to combine the simultaneous direction-related activities of a population of motor cortical neurons to predict the trajectory of the arm movement. In this article, we consider a self-organizing model of a neural representation of the arm trajectory based on neuronal discharge rates. A self-organizing feature map (SOFM) is used to select the optimal set of weights in the model to determine the contribution of an individual neuron to an overall movement representation. The correspondence between movement directions and discharge patterns of the motor cortical neurons is established in the output map. The topology-preserving property of the SOFM is used to analyze the recorded data of a behaving monkey. The data used in this analysis were taken while the monkey was tracing spirals and doing center→out movements. The arm trajectory could be well predicted using such a statistical model based on the motor cortex neuronal firing information. The SOFM method is compared with the population vector method, which extracts information related to trajectory by assuming that each cell has a fixed preferred direction during the task. This implies that these cells are acting along lines labeled only for direction. However, extradirectional information is carried in these cell responses. The SOFM has the capability of extracting not only direction-related information but also other parameters that are consistently represented in the activity of the recorded population of cells.

1 Introduction The population vector method was developed by Georgopoulos and his collaborators to investigate the relations between the neuronal activity in the motor cortex of the monkey and the direction of arm movements in twoand three-dimensional spaces (Georgopoulos et al. 1982, 1983, 1984, 1988). Neural Computation 9, 607–621 (1997)

c 1997 Massachusetts Institute of Technology °

608

Siming Lin, Jennie Si, and A. B. Schwartz

In the study of reaching movement in a two-dimensional space, monkeys were trained to make point-to-point movements from a center start position to one of eight targets spaced equally around a circle (Schwartz 1992). Neurons recorded during this task were found to have mean discharge rates that were highest in a single “preferred” direction and tapered off gradually in directions farther away from the preferred direction, firing at their lowest rates for the movement directions 180 degrees from the preferred direction. The relation between movement directions and discharge rates for individual neurons could be fit with a cosine tuning function (Georgopoulos et al. 1982), D = b0 + k cos(θ − θ0 ),

(1.1)

where b0 is a regression coefficient corresponding to the mean activity in the task, k is a regression coefficient corresponding to the depth of the modulation across different movement directions, θ is the direction of the movement, θ0 is the preferred direction, and D is the discharge rate. The tuning function of individual motor cortical neurons spans the entire directional domain. Each directional neuron encodes all directions of the movement. Conversely, all neurons in the cortical population simultaneously encode a single movement (Schwartz 1993). If Ci is the unit preferred direction vector for the ith neuron, then the neuronal population vector P(t) is determined as the weighted sum of these vectors, P(t) =

X

ri (t)Ci ,

(1.2)

i

where the weight ri (t) is the activity (discharge rate minus the geometric mean) of the ith neuron at time bin t. The discharge rate can be measured during the monkey’s arm movements, resulting in a series of weighted vectors for each neuron. When these vectors are summed across the population for each bin, the resulting series of population vectors P(t) represents well the direction of the movement as it changes during the task. Moreover, the magnitudes of these population vectors suggest instantaneous displacement (speed) throughout the task. Thus, connecting these population vectors tip-to-tail together, one may obtain a neural trajectory. It was shown that real trajectories of arm movement could be predicted by neural trajectories (Georgopoulos et al. 1988; Schwartz 1993, 1994). To use the population vector approach described, one needs to know the preferred direction of every neuron in the cortical population. Usually this is done by experiments. Any bias in the preferred direction vectors could result in prediction error in the trajectory. Some other methods have been used for population decoding. The simulated annealing algorithm has been used to adjust the connection strengths of a feedback neural network so that it would generate a given trajectory

Self-Organization of Firing Activities in Monkey’s Motor Cortex

609

by a sequence of population vectors (Lukashin and Georgopoulos 1994). As in the population vector algorithm, the preferred direction is required in this algorithm. The optimal linear estimator (OLE) is a statistical method for population decoding (Salinas & Abbott 1994). The estimation formula of OLE is X ri (t)Di , (1.3) Vest (t) = i

where Vest is the estimate of the movement direction and ri is the measured firing rate (or normalized firing rate) of neuron i. Note that Di is different from the preferred direction. Specifically, Di =

X

Q−1 ij Lj ,

(1.4)

Vfj (V) dV,

(1.5)

j

with

Z Lj =

and Q is the correlation matrix of firing rates determined as Z Qij = ri rj P(r | V) dr dV,

(1.6)

where V denotes the actual movement direction, P(r | V) denotes the probability of obtaining the firing response r given that the movement direction takes the value V, and fi (V) is the average discharge rate of cell i when the movement direction takes the value V. Equivalently, Z (1.7) fi (V) = ri P(r | V) dr. It is shown that the cosine tuning curve is the optimal case for linear decoding methods, and the OLE method can produce the smallest average error of the linear methods (Salinas & Abbott 1994). However the OLE method may encounter difficulty when the number of cells used is large and when Q is ill conditioned. When noise is present, the recorded discharge rates are not perfect cosine tuning curves anymore, and the assumption made for guaranteeing the performance of OLE is no longer true. One of the objectives of this study is to provide a statistical model that extracts directional patterns encoded in the discharge rates with the following advantages: (1) without using the preferred directions, (2) robust against noise, and (3) easy to implement in hardware. The population vector method suggests a close association between the discharge rates of cells in the motor cortex and the direction of arm movement.

610

Siming Lin, Jennie Si, and A. B. Schwartz

In this article, we consider a self-organizing model of the neuronal discharge patterns based on neuronal discharge rates. The inputs to the model are averaged discharge rates from a group of cells in the monkey’s motor cortex. These discharge rates were taken while the animal was tracing spirals or doing center→out reaching. The output map of the model is a two-dimensional plane. One dimension of the polar coordinate codes the predicted direction of arm movement; the other contains nonspecific movement-associated parameters such as speed and position. The nodes in the output map are interconnected by a neighborhood function during training. SOFM was used to adapt the weights of the network to generate a mapping between discharge rates of motor cortical neurons and directions of arm movement. The main advantage of this approach is that preferred directions are not prerequisites for obtaining predicted directions. Consequently the possibility of error accumulation from nonuniformities in preferred directions is reduced. Moreover, the topology-preserving property of the SOFM enables us to calibrate the output map with the topology relation within input discharge rates, which is useful when interpreting the computation results. Here we focus on the statistical discrimination of behavioral modes while the monkey was tracing spirals. The computer simulations reveal that the behavioral modes are clearly differentiated, and the map simultaneously provides a clear topological relationship among these modes. The movement directions during the spiral tracing task are well predicted by the discharge rate patterns using the self-organizing model. The population vector method has already been used to show that directions and speed are well represented in the activity of these cells (Schwartz 1993, 1994). This method assumes the linear functional relationship between discharge rates and movement directions as shown in equation 1.2. It may not capture all the consistent relationships between cell activity and different parameters that could be represented in this activity. In contrast, the statistical nature of the SOFM technique will capture these relationships without a priori assumptions about what is represented or the form of the code.

2 Spike Signal and Feature Extraction A rhesus monkey was trained to trace with its index finger on a touchsensitive computer monitor. In the spiral tracing task, the monkey was trained to make a natural, smooth drawing movement within approximately a 1 cm band around the presented figure. As the figure appeared, the target circle of about 1 cm radius moved a small increment along the spiral figure, and the monkey moved its finger along the screen surface to the target. The spirals we study in this article are outside→in spirals. (A more detailed description about this process can be found in Schwartz 1992, 1994.)

Self-Organization of Firing Activities in Monkey’s Motor Cortex

611

Figure 1: Calculation of dij : the number of spike intervals overlapping with the jth time interval is determined to be 3. As shown, two-thirds of the first interval, all of the second interval, and one-fourth of the third interval are located in the jth time window. Therefore, dij =

2 +1+ 1 3 4

dt

.

While the well-trained monkey was performing the previously learned routine, the occurrence of each spike signal in an isolated cell in the motor cortex was recorded. The result of these recordings is a spike train at the following typical time instances: τ1 , τ2 , . . . , τL . This spike train can be ideally modeled by a series of δ functions at the appropriate times: Sc (t) =

L X

δ(t − τl ).

(2.1)

l=1

Let Ä be the collection of the monkey’s motor cortex cells whose activities contribute to the monkey’s arm movement. Realistically we can record only a set of sample cells S = {Si ∈ Ä, i = 1, 2, . . . , n} to analyze the relation between discharge rate patterns and the monkey’s arm movements. We assume that the cells included in the set S constitute a good representation of the entire population. The firing activity of each cell Si defines a point process fsi (t). Let dij denote the average discharge rate for the ith cell at the jth interval (tj , tj + dt). Taking dij of all the cells {Si ∈ Ä, i = 1, 2, . . . , n} in (tj , tj + dt), we obtain a vector Xj = [d1j , d2j , . . . , dnj ] which will later be used as the feature input vector relating to the monkey’s arm movement during the time interval (tj , tj + dt). The average discharge rate dij = Dij /dt is calculated from the raw spike train as follows. For the jth time interval (tj , tj +dt), one first has to determine all the intervals between any two neighboring spikes that overlap with the jth time interval. Each of the spike intervals overlapping completely with (tj , tj + dt) contributes a 1 to Dij . Each of those overlapping only partially with (tj , tj + dt) will contribute the percentage of the overlapping to Dij . An illustrative example is shown in Figure 1. Due to limitations in experimental realization, the firing activities of the n cells in S could not be recorded simultaneously. Consequently the average

612

Siming Lin, Jennie Si, and A. B. Schwartz

discharge rate vector Xj could not be calculated simultaneously at time tj . However, special care was given to ensure that the monkey repeated the same experiment n times to obtain the firing signal for all the n cells under almost the same experimental condition. We thus assume that the experiments of recording every single cell’s spike signal were identically independent events so that we could use spike signals of the n cells as if they were recorded simultaneously. In our simulations, spike signals from 81 cells in the motor cortex were used. The average discharge rates were calculated every 20 ms. The entire time span of each trial may be a little different from one another. Cubic interpolations were used to normalize the recorded data (average discharge rates and movement directions in degrees) to 100 time bins. At a given bin, normalized directions (in degrees) were averaged across 81 cells, giving rise to a finger trajectory sequence: {∠1 , ∠2 , . . . , ∠100 }. Correspondingly at every time bin j, we have a normalized discharge rate vector Xj = [d1j , d2j , . . . , d81j ]T , j = 1, 2, . . . , 100. Discharge rate vectors and moving directions were used to train and label the SOFM. 3 The Computation Model The SOFM has been applied successfully to speech processing (Kohonen 1988), robotics (Ritter & Schulten 1986), vector quantization (Nasrabadi & Feng 1988; Yair & Zeger 1992), and biological modeling (Obermayer et al. 1990). The SOFM learns a mapping from the input data space
∀p 6∈ Nc (t),

∀p ∈ Nc (t)

(3.1) (3.2)

where t = 0, 1, 2, . . . is the discrete-time coordinate, α(t) is the learning rate factor, and Nc (t) defines a set of neighborhood nodes of the winner node c. The winner node c is defined as the node whose weight vector has the smallest Euclidean distance from the input X(t): kWc (t) − X(t)k ≤ kWp (t) − X(t)k

∀p.

(3.3)

In our current application, we have used a 20×20 two-dimensional lattice as the output layer and α(t) = 0.95(1 − t/N) with N = 200,000. We started with the initial radius of Nc (t), which equals the diameter of the output layer, and let it shrink with time.

Self-Organization of Firing Activities in Monkey’s Motor Cortex

613

In our application, the SOFM acts as a decoder. By finding the winners, SOFM maps the input onto a node in the output layer. Every node in the output layer decodes an association with inputs. The critical feature of this algorithm is the topology preserving. We call a map topology preserving if (1) similar input vectors are mapped onto identical or neighboring output nodes, and (2) neighboring nodes have similar weight vectors. Property 1 ensures that small changes in the input vector cause correspondingly small changes in the location of the output winner. This makes SOFM robust against distortions of inputs. Property 2 ensures robustness of the inverse mapping, which is critical for the accuracy of SOFM as a decoder. 4 Trajectory Computation from Motor Cortical Discharge Rates The major objective of this study is to investigate the relationship between a monkey’s arm movements and firing patterns in the motor cortex. The topology-preserving property of the SOFM is used to analyze the recorded data of a behaving monkey. To train the network, we select n cells (in this study, n = 81) in the motor cortex. The average discharge rates of these n cells constitute a discharge rate vector Xj = [d1j , d2j , . . . , dnj ]T , where dij , i = 1, . . . , n, is the average discharge rate of the ith cell at time bin tj . A set of vectors {Xk | k = 1, 2, . . . , NP} recorded from experiments are the training patterns of the network, where NP is the number of training vectors. During training, a vector X ∈ {Xk | k = 1, 2, . . . , NP} is selected from the data set at random. The discharge rate vectors are not labeled or segmented in any way in the training phase. All the features present in the original discharge rate patterns will contribute to the self-organization of the map. Once the training is done, the network has learned the classification and topology relations in the input data space. Such a “learned” network is then calibrated using the discharge rate vectors of which classifications are known. For instance, if a node i in the output layer wins most for discharge rate patterns from a certain movement direction, then the node will be labeled with this direction. In this part of the simulation, we used numbers 1 ∼ 16 to label the nodes in the output layer of the network, representing the 16 quantized directions equally distributed around a circle. For instance, a node with the label 4 codes the movement direction of 90 degrees. If a node in the output layer of the network never wins in the training phase, we label the node with a −1. In the testing phase, a −1 node can become a “winner” just like those marked with positive numbers. Then the movement direction is read out from its nearest node with a positive direction number. Figure 2 gives the learning results using discharge rates recorded from the 81 cells in the motor cortex in one trial of the spiral task. From the output map, we can clearly identify three circle-shaped patterns representing the spiral trajectory. These results show that the monkey’s arm movement directions are clearly encoded in firing patterns of the motor cortex. By the

614

Siming Lin, Jennie Si, and A. B. Schwartz

Figure 2: The self-organized map for the discharge rates of the 81 neurons in the motor cortex in the spiral task. The size of the output map of the SOFM is 20×20. The nodes marked with positive numbers are cluster centers, representing the directions as labeled, and the nodes labeled −1 are the neighbors of the centers. The closer a node is to the center, the more possible it represents the coded direction of the center.

property of topology preserving, we can distinguish a clear pattern from the locations of the nodes in the map. Note that nodes with the same numbers (i.e., directions) on different spiral cycles are adjacent. This is the same for the actual trajectory and is what we mean by topology preserving, an attribute of the SOFM. Different cycles of the spiral are located in correspondingly different locations of the feature map, suggesting that parameters in addition to directions influence cortical discharge rates. 4.1 Using Data from Spiral Tasks to Train the SOFM. In this experiment, we used data recorded from five trials in spiral tracing tasks. The first trial was used for testing and the rest for training. After the training and labeling procedure, every node in the output map of the SOFM was associated with one specific direction. The direction of a winner was called “neural direction” when discharge rate vectors were presented as the input of the

Self-Organization of Firing Activities in Monkey’s Motor Cortex

615

Figure 3: The neural directions and finger directions in 100 bin series in the inward spiral tracing task.

network bin by bin in the testing phase. Figure 3 gives neural directions against the corresponding monkey’s finger movement directions. After postfiltering the predicted neural directions with a moving average filter of 3-bin width, we combined the neural directions tip-to-tail bin by bin. The resulting trace (neural trajectory) is similar to the finger trajectory. Note that the neural directions are unit vectors. To use vectors as a basis for trajectory construction, the vector magnitude must correspond to movement speed. During drawing, there is a “two-thirds power law” (Schwartz 1994) relation between curvature and speed. Therefore, the speed remains almost constant during the spiral drawing. This is the reason we can use unit vectors in this construction and still get a shape that resembles the drawn figure. Figure 4 demonstrates the complete continuous monkey’s finger movement and the predicted neural trajectory. 4.2 Using Data from Spiral and Center→Out Tasks to Train the SOFM. In addition to the spiral data, the data recorded in the center→out tasks were added to the training data set. Figure 5 shows that the overall performance

616

Siming Lin, Jennie Si, and A. B. Schwartz

Figure 4: Using data from spiral task for training. Left: The monkey’s finger movement trajectory calculated with constant speed. Right: The SOFM neural trajectory.

Figure 5: Neural trajectory: Using data from spiral and center→out tasks for training.

of the SOFM has been improved. This result suggests that the coding of arm movement direction is consistent across the two different tasks. 4.3 Average Testing Result Using the Leave-k-Out Method. The data recorded from five trials in the spiral tasks and the center→out tasks were used in this experiment. Every time we chose one of the trials in spiral tasks for testing and the rest for training. We had five different sets of experiments for training and testing the SOFM. Note that the training and testing data were disjoint in each set of experiments. The neural trajectories from the

Self-Organization of Firing Activities in Monkey’s Motor Cortex

617

Figure 6: Neural trajectory: Average testing result using leave-k-out method.

five tests were averaged, and Figure 6 shows the averaged result. Due to the nonlinear nature of the SOFM network, a different training data set may result in a different outcome in the map. Therefore, the leave-k-out method provides a realistic view on the overall performance of the SOFM. Figure 7 shows the bin-by-bin difference between the neural directions and the finger directions in 100 bins. The dashed line in the figure represents the average error in 100 bins. 4.4 Trajectory Computation by the Population Vector Algorithm. Figures 8 and 9 give the computation results by using the population vector method based on the same data as in the leave-k-out test. The average error in the first circle of the neural trajectory is about 10 degrees. This is comparable to the average error of 8.65 degrees from the SOFM algorithm; however, the error grows over time, as shown in Figure 9. 5 Discussion Our results based on raw data recorded from a monkey’s motor cortex have revealed that the monkey’s arm trajectory is encoded in the averaged discharge rates of the motor cortex. Note from Figure 2 that when the monkey’s finger was moving along the same direction on different cycles of the spiral, one observes distinct cycles on the output map as well. By the topologypreserving property of the SOFM, this suggests that the output nodes decode not only directions but also probably speed, curvature, and position. Although the population vector method has previously shown this information to be present in the motor cortex, the SOFM model captures regularities

618

Siming Lin, Jennie Si, and A. B. Schwartz

Figure 7: The binwise difference between the finger directions and the neural directions in 100 bins. The dashed line is the average error in 100 bins. The neural trajectory is the average testing result of the SOFM using the leave-k-out method.

Figure 8: Neural trajectory computed by using the population vector method. The same data of the 81 cells used in the leave-k-out method were used.

about speed and curvature without assuming any linear functional relationships to discharge rates as the population vector and OLE method do in equations 1.2 and 1.3, respectively. Although we did not explicitly examine speed in this study, it has been found that the population vector magnitude is highly correlated with speed (Schwartz 1993), showing that this parame-

Self-Organization of Firing Activities in Monkey’s Motor Cortex

619

Figure 9: The binwise difference between the finger directions and the neural directions in 100 bins. The dashed line is the average error in 100 bins. The neural trajectory is computed by using the population vector method.

ter is represented in motor cortical activity. For those drawing movements with more significant changes in speed, we may quantize the speed as we did with directions. As we discussed in section 4, the direction and the curvature or speed may be encoded in discharge rates as a feature unit; we can then label the nodes in the output map using directions and speed together. Consequently the same process of direction recognition by using the SOFM could be applied to speed recognition based on discharge rates of the motor cortex. Other types of networks, such as the sigmoid feedforward and radial basis function networks, could have been used to associate discharge rates with movement directions. The SOFM, however, offers a unique combination of both data association (which some other networks can achieve) and topology information as represented by the two-dimensional output map (which is hard or even impossible for other networks to implement). Hebb hypothesized in 1949 that the basic information processing unit in the cortex is a cell assembly, which may include thousands of cells in a highly interconnected network. This cell assembly hypothesis shifts the focus from a single cell to the complete network activity. In our current study, information from 81 cells in the monkey’s motor cortex has been used to predict the monkey’s arm movement directions. Our results show that these 81 cells have characterized the monkey’s motor cortical activities quite well in a spiral arm movement. Other studies, such as synfire chain (Abeles 1991) and multicell correlation (Vaadia et al. 1991), have also revealed the significance of the amount of information provided by local measurements with a limited number of cells in the associative cortex. It is quite intriguing that

620

Siming Lin, Jennie Si, and A. B. Schwartz

a small set of cells could present the hypothesized cell assembly quite well in certain tasks. As mentioned in section 2, spike signals of the 81 cells were recorded in 81 independent experiments. Extracellular electrophysiological measurements are now being extracted through simultaneous recording on a few randomly selected cells. We plan to use the SOFM to extract movement information efficiently in real time from small groups of simultaneously recorded units. Acknowledgments This research was supported in part by the Whitaker Foundation. The first two authors’ research is also supported in part by NSF under grant ECS9552302. Work carried out by the third author while at Barrow Neurological Institute was supported by NIH R01 NS-26375. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). The relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci., 2 (11), 1527–1537. Georgopoulos, A. P., Caminiti, R., Kalaska, J. F., Crutcher, M. D., & Massey, J. T. (1983). Spatial coding of movement: A hypothesis concerning the coding of movement direction by motor cortical populations. Exp. Brain Res. Suppl., 7, 327–336. Georgopoulos, A. P., Kalaska, J. F., Crutcher, M. D., Caminiti, R., & Massey, J. T. (1984). The representation of movement direction in the motor cortex: Single cell and population studies. In G. M. Edelman, W. E. Goll, & W. M. Cowan (Eds.), Dynamic aspects of neocortical function (pp. 501–524). New York: Wiley. Georgopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1988). Primate motor cortex and free arm movements to visual targets in 3-D space. II. Coding of the direction of movement by a neuronal population. J. Neurosci., 8(8), 2928–2937. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Kohonen, T. (1988). The “neural” phonetic typewriter. Computer, 21(3), 11–22. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78, 1464– 1480. Lukashin, A. V., & Georgopoulos, A. P. (1994). A neural network for coding of trajectories by time series of neuronal population vectors. Neural Computation, 6, 19–28. Nasrabadi, N. M., & Feng, Y. (1988). Vector quantization of images based upon the Kohonen self-organizing feature maps. IEEE International Conference on Neural Networks, San Diego, pp. 1101–1108. Obermayer, K. H., Ritter, H., and Schulten, K. (1990). Large-scale simulations of self-organizing neural networks on parallel computers: Application to biological modeling. Parallel Computing, 14, 381–404.

Self-Organization of Firing Activities in Monkey’s Motor Cortex

621

Ritter, H., & Schulten, K. (1986). Topology conserving mappings for learning motor tasks. AIP Conference Proceedings, 151, 376–380. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rate. Journal of Computational Neuroscience, 1, 89–108. Schwartz, A. B. (1992). Motor cortical activity during drawing movements: Single-unit activity during sinusoid tracing. J. Neurophysiology, 68, 528–541. Schwartz, A. B. (1993). Motor cortical activity during drawing movements: Population representation during sinusoid tracing. J. Neurophysiology, 70, 28–36. Schwartz, A. B. (1994). Direct cortical representation of drawing. Science, 265, 540–542. Vaadia, E., Ahissa, E., Bergman, H., & Lavner, Y. (1991). Correlated activity of neurons: A neural code for higher brain function. In J. Kruger (Ed.), Neural cooperativity (pp. 249–279). Berlin: Springer-Verlag. Yair, E., & Zeger, K. G. (1992). Competitive learning and soft competition for vector quantizer design. IEEE Trans. on Signal Processing, 40, 294–309.

Received September 20, 1995; accepted May 14, 1996.

Communicated by David MacKay

Hyperparameter Selection for Self-Organizing Maps Akio Utsugi National Institute of Bioscience and Human Technology, Higashi Tsukuba Ibaraki 305, Japan

The self-organizing map (SOM) algorithm for finite data is derived as an approximate maximum a posteriori estimation algorithm for a gaussian mixture model with a gaussian smoothing prior, which is equivalent to a generalized deformable model (GDM). For this model, objective criteria for selecting hyperparameters are obtained on the basis of empirical Bayesian estimation and cross-validation, which are representative model selection methods. The properties of these criteria are compared by simulation experiments. These experiments show that the cross-validation methods favor more complex structures than the expected log likelihood supports, which is a measure of compatibility between a model and data distribution. On the other hand, the empirical Bayesian methods have the opposite bias. 1 Introduction Several standard learning methods for neural networks are being reconstructed as an estimation algorithm of a stochastic model. Such statistical treatment of learning enables inference at a higher level than a simple parameter estimation level, such as the evaluation of model reliability and automatic model selection. For example, MacKay (1992) studied backpropagation learning in a unifying Bayesian framework and presented a selection method of hyperparameters and model structure. Among many learning methods, the self-organizing map (SOM) (Kohonen 1988, 1990) is unique from the viewpoint of data analysis because of its ability to extract topological structure hidden in data. However, since this learning was originally defined only at an algorithmic level, its development to higher-level inference is difficult. Statistical models behaving in a similar manner as SOM are also studied. The elastic net (Durbin et al. 1989), which is one of the generalized deformable models (GDM) (Yuille 1990), learns a topology-preserving map as maximum a posteriori (MAP) estimates for the parameters of a Bayesian stochastic model. Although this model was studied originally in the context of an optimization problem such as the traveling salesman problem, it can be used for smoothing data along a specified topology. However, this requires the determination of two hyperparameters: the size of noise and Neural Computation 9, 623–635 (1997)

c 1997 Massachusetts Institute of Technology °

624

Akio Utsugi

the smoothness of route. The selection of these hyperparameters was attempted by an empirical Bayesian method similar to that of MacKay for backpropagation learning (Utsugi 1993). This article first reviews the hyperparameter selection method. Next, the SOM algorithm for finite data is derived as an approximate MAP estimation algorithm for GDM. Thus, SOM and GDM can be considered equivalent at a stochastic model level. Finally, cross-validation, another representative model selection method, is applied to hyperparameter selection for our model and compared with the empirical Bayesian method by simulation experiments. 2 Empirical Bayesian Hyperparameter Selection for GDM 2.1 Construction of GDM as Stochastic Model. Initially, we regard the simplest type of gaussian mixture model as a stochastic model for competitive learning (Nowlan 1990). For a data set X consisting of m-dimensional data points xi = (xi1 , . . . , xim )0 (i = 1, . . . , n), the likelihood function of the model with r components is given by f (X |w, β) =

r n X Y 1 i=1 s=1

r

f (xi |ws , β),

(2.1)

where µ f (xi |ws , β) =

β 2π

¶m/2

¶ µ β exp − kxi − ws k2 2

(2.2)

is a component gaussian density with the centroid ws = (ws1 , . . . , wsm )0 and the variance 1/β, and w = (w01 , . . . , w0r )0 . Furthermore, we consider a gaussian smoothing prior to express smooth variation of the centroids along a topological space. Using a discretized differential operator matrix D on the topological space, the density of this smoothing prior is defined by g(w|α) =

µ ¶ m ³ Y α α ´l/2 (det + D 0 D )1/2 exp − kDw(j) k2 , 2π 2 j=1

(2.3)

where w(j) = (w1j , . . . , wrj )0 , l = rank D 0 D , and det+ D 0 D denotes the product of positive eigenvalues of D 0 D . The hyperparameter α represents the strength of smooth constraint. Although the elastic net uses the firstorder differential operator on a one-dimensional closed-loop topology, we can use various kinds of differential operators and topologies.

Hyperparameter Selection for Self-Organizing Maps

625

From the likelihood (cf. equation 2.1) and the prior (cf. equation 2.3), we obtain the log posterior by Bayes’ theorem: log g(w|X , α, β) = log f (X |w, β) + log g(w|α) + const. µ ¶ n r X X β 2 = log exp − kxi − ws k 2 i=1 s=1 −

m αX kDw(j) k2 + const. 2 j=1

(2.4)

This corresponds to the negative energy function of an elastic net, whose maximizer gives the MAP estimates of centroids. The elastic net algorithm is a MAP estimation algorithm using the gradient ascent method. We can also use the expectation-maximization (EM) algorithm (Yuille et al., 1994; Utsugi 1994), which is explained in section 3. 2.2 Selection of Hyperparameters by Empirical Bayesian Method. Next, we obtain the marginal likelihood of hyperparameters α and β: Z Z f (X |w, β)g(w|α)dw ∝ g(w|X , α, β)dw f (X |α, β) = Z = f (w, X |α, β)dw. (2.5) This is also called the evidence of the hyperparameters. Although we desire to obtain the optimal values of hyperparameters by maximizing the evidence, we have difficulty calculating the integral in equation 2.5 exactly. Here, we use a gaussian approximation (MacKay 1992), where the logarithm of the integrand is substituted by its quadratic approximation at the maximizer. ˆ and the negative Hesse matrix of In this case, using the MAP estimate w the log posterior, ∂2 log g(w|X , α, β), ∂ w∂ w0 the integrand is approximated as µ ¶ 1 ˆ , X |α, β) exp − w0 H (w ˆ )w . f (w, X |α, β) ' f (w 2

H (w) = −

(2.6)

(2.7)

Then the evidence is approximated as Z f (w, X |α, β)g(w|α)dw f (X , Sw ˆ |α, β) = S ˆ Z w

=

g(w|X , α, β)dw Sw ˆ

ˆ ))−1/2 f (w ˆ , X |α, β) ' (2π)rm/2 (det H (w

(2.8)

626

Akio Utsugi

ˆ in the parameter space. This evidence where Sw ˆ is a region dominated by w consists of probability mass on only Sw ˆ , and thus it should be called local evidence. We use the local evidence to select the values of hyperparameters, like MacKay’s manner for backpropagation learning. Now, the log evidence is calculated by µ ¶ n r X X nm β 2 ˆ log β + k |α, β) ' log exp − x − w k log f (X , Sw i s ˆ 2 2 i=1 s=1 +

lm log α 2

−

m αX ˆ (j) k2 kD w 2 j=1

1 ˆ ) + const. − log det H (w 2

(2.9)

ˆ ) is the sum of negative Hesse matrices of the log likelihood The matrix H (w ˆ ) and H g , respectively. The and the log prior, which are denoted by H f (w ˆ ) consists of submatrices, matrix H f (w ∂2 ˆ)=− ˆ , β) H st (w log f (X |w ∂ ws ∂ w0t  2 Pn ˆ s )(xi − w ˆ s )0 + βns I m , β i psi (psi − 1)(xi − w =  2 Pn ˆ s )(xi − w ˆ t )0 , β i psi pti (xi − w (s, t = 1, . . . , r),

s=t s 6= t (2.10)

where I m is an identity matrix with size m. The quantity psi is defined by ˆ s , β) f (xi |w , ˆ s , β) s=1 f (xi |w

psi = Pr

(2.11)

which is called P the fuzzy membership of the ith data to the sth component, and ns = ni=1 psi is the estimated number of data points belonging to the sth component. The matrix H g is given by

H g = αD0 D ⊗ I m ,

(2.12)

where ⊗ denotes the Kronecker product. By maximizing the evidence, we obtain estimates of α and β. Simulation experiments showed that this method gives good solutions (Utsugi 1993).1 1 This evidence is improved from the previously presented one, for which r was used instead of l in equation 2.9. In a wide range of α, these are almost the same. However, the old evidence grows infinitely as α → ∞, while the new one stays finite.

Hyperparameter Selection for Self-Organizing Maps

627

3 Derivation of SOM Algorithm In this section, the original SOM algorithm for finite data is derived as an approximate MAP estimation algorithm for the GDM. Initially, we showed that the EM algorithm of GDM is an alternate iteration of smoothing and soft classification. Now we define binary membership variables Y = (ysi ), where ysi denotes the membership of the ith data point to the sth component. Regarding these variables as missing data, we obtain a complete likelihood, f (X , Y |w, β) =

r µ n Y Y 1 i=1 s=1

r

¶ysi

f (xi |ws , β)

,

(3.1)

which would be an exact likelihood if the missing data were acquired. In the EM algorithm, the following function is maximized instead of the genuine log posterior (cf. equation 2.4), ˆ , β) + log g(w|α), Q(w) = EY (log f (X , Y |w, β)|X , w

(3.2)

ˆ is a temporary estimate and EY ( · | · ) denotes conditional expecwhere w ˆ at the next step. tation with respect to Y . The maximizer of Q is used as w This procedure is iterated until convergence. By calculating Q, we obtain Q(w) = −

r n X m βX αX psi kxi − ws k2 − kDw(j) k2 + const. , 2 i=1 s=1 2 j=1

(3.3)

where psi is the fuzzy membership (cf. equation 2.11) using the temporary ˆ . This maximization of Q is equivalent to the independent miniestimate w mizations of Hj (w(j) ) =

r X

ns (x¯ sj − wsj )2 + γ kDw(j) k2 ,

(j = 1, . . . , m),

(3.4)

s=1

where x¯ sj is the mean of data weighted by the fuzzy membership, x¯ sj =

n 1 X psi xij , ns i=1

(3.5)

and γ = α/β. Each function Hj can be regarded as a discretized Laplacian smoothing criterion (O’Sullivan 1991) for the observations {x¯ sj : s = 1, . . . , r} with confidence weights {ns }, which are obtained through the soft competition process (cf. equation 2.11). A smooth curve minimizing this criterion is used as the next temporary estimate and is given by ¯ (j) , ˜ (j) = KN x w

(3.6)

628

Akio Utsugi

¯ (j) = (x¯ 1j , . . . , x¯ rj )0 and where N is a diagonal matrix consisting of {ns }, x

K = (N + γ D 0 D )−1 .

(3.7)

From the above explanation, the EM algorithm of GDM can be regarded as an alternate iteration of discretized Laplacian smoothing and soft classification. On the other hand, Mulier and Cherkassky (1995) interpreted the batch SOM algorithm as an alternate iteration of kernel smoothing and hard clas¯ t = (x¯ t1 , . . . , x¯ tm )0 , which are the number and the sification. Using nt and x mean of data points in the Voronoi region of the weight point of the tth inner unit, they expressed the weight update rule of SOM as w˜ sj =

r X t=1

κ(s, t)nt x¯ tj /

r X

κ(s, t)nt ,

(j = 1, . . . , m),

(3.8)

t=1

where κ(s, t) is a kernel function, for example, a gaussian density function with respect to s − t for the normal one-dimensional topology. This is regarded as extended Nadaraya-Watson kernel smoothing (H¨ardle 1990) of the observations {x¯ tj } with confidence weights {nt }. In this point, the difference between GDM and SOM is summarized as follows: (1) soft classification versus hard classification, (2) discretized Laplacian smoothing versus kernel smoothing, and (3) in the original SOM, incremental learning is used rather than batch learning. Each method used in SOM can be regarded as an approximation of the associated method used in GDM, as explained below. The soft competition process (cf. equation 2.11) turns hard as β → ∞, and thus hard classification gives a good approximation for soft classification if β is large.2 The discretized Laplacian smoothing can also be approximated by kernel smoothing. As shown in the appendix, a curve smoothed by discretized Laplacian smoothing (cf. equation 3.6) is expressed by the kernel smoothing form (cf. equation 3.8) using the entries of K as the values of kernel function κ(s, t). In reality, this kernel function is variable according to the variation of N , unlike SOM. In many cases, however, N is nearly proportional to I r at the last stage of learning, since the expectation of N is proportional to I r . In particular, for the second-order differential operator on a one-dimensional 2

Also, hard competition can be derived from a MAP estimation algorithm of GDM with a classification likelihood rather than the mixture likelihood (cf. equation 2.1). The classification likelihood has the same form as the complete likelihood (cf. equation 3.1), though Y is regarded as a parameter rather than missing data. In general, classification likelihood approaches lead to poorer results than mixture likelihood approaches (McLachlan & Basford 1988).

Hyperparameter Selection for Self-Organizing Maps

629

line-segment topology, whose entries are   −2 1 dij =  0

|i − j + 1| = 0 |i − j + 1| = 1 otherwise,

(i = 1, . . . , r − 2; j = 1, . . . , r)

(3.9)

we can obtain an explicit form of the kernel function using Silverman’s (1984) equivalent kernel of spline smoothing (Utsugi 1994). This kernel function has a Mexican-hat shape with a width proportional to γ 1/4 . Finally, we can also use a generalized EM algorithm (Dempster et al. 1977), where we use ˆ + cw ˜ w = (1 − c)w

(0 < c ≤ 1)

(3.10)

˜ . Using small c, we can make as the next temporary estimate instead of w the variation of the parameter in one leaning step arbitrarily small, where batch and incremental leaning are similar. 4 Comparison of Hyperparameter Selection Methods 4.1 Hyperparameter Selection by Cross-Validation. In section 2, an empirical Bayesian method for hyperparameter selection was studied. Another commonly used method for such problems is cross-validation. In this section, we apply cross-validation to our problem and compare the results with those of the empirical Bayesian method through simulation experiments. Generally the rationale of cross-validation is as follows. The expected log likelihood (ELL) by a true data distribution is a good measure of model adequacy. In reality, we cannot calculate this value because of the unknown true data distribution. Instead, we use a cross-validation score as an unbiased estimate of ELL. However, this estimated criterion has considerable variance for small data, and its maximizer is not an unbiased estimate for the peak position of ELL. In the next simulation, we will calculate ELL, in addition to cross-validation scores and evidence, to observe the bias and variance of these criteria. 4.2 Simulation Experiments. We apply a GDM with 20 components and the differential operator (cf. equation 3.9) to two types of artificial data sets: low and high noise condition (see Figure 1). By using the values of hyperparameters at rectangular grid points in the hyperparameter space, we obtain the graph of each criterion. The landscapes and contour maps in Figures 2 and 3 are obtained by averaging each criterion for 20 different data sets. The histograms of peak positions for individual data sets also are shown in the figures. Figure 4 illustrates the configurations of estimated centroid parameters under several values of (α, β) in the low-noise condition. From these figures, we can conclude the following.

630

Akio Utsugi

Figure 1: Scatter plots for samples of data sets. Artificial data xi = (xi1 , xi2 )0 , (i = 1, . . . , 100) are generated from two independent standard gaussian random series {ei1 } and {ei2 } by xi1 = 4(i−1)/n−2+σ ei1 , xi2 = sin[2π(i−1)/n]+σ ei2 . Two conditions of noise level are used: low noise (σ = 0.3) and high noise (σ = 0.5). For each condition, 20 data sets are used in the simulation.

In the case of low noise (see Figure 2), peak positions of all criteria are close to each other except for a few peaks of cross-validation scores. In the area where the majority of the peaks are gathering, the configurations of centroid parameters have good appearances (see Figure 4, log10 α = 2, log10 β = 1). Thus, we can say that both methods succeed for the most part. However, cross-validation scores have two peaks with much lower α than the other peaks, which lead to configurations that are too complicated. This is probably due to the flatness of the averaged landscape in its low α area and large variance of cross-validation scores. Because of this instability, the crossvalidation method is inferior to the other method in this case, though its averaged landscape is similar to that of ELL. Figure 2: Facing page. Landscapes and contour maps of averaged criteria and peak-position histograms for 20 data sets in the low-noise condition. For each data set, the MAP estimates of centroids are obtained by the EM algorithm of GDM (r = 20). The values of hyperparameters are taken from the grid points in the hyperparameter space. For each fixed β, α is decremented in the grid, and the MAP estimates in the preceding α are used as initial values of centroids. Log evidence is calculated by equation 2.9. Cross-validation scores are obtained in the following manner. A data set is divided randomly into 10 groups of the same size. For each group, centroids are reestimated using the data in all but the group, where the MAP estimates from all data are used as initial values. For the reestimated centroids, their log likelihood is calculated using the data in the group. A cross-validation score is given as the mean of the log likelihoods. ELL is approximated by the mean log likelihood of the MAP estimates evaluated by 1000 newly generated data.

Hyperparameter Selection for Self-Organizing Maps

631

632

Akio Utsugi

Figure 3: Landscapes and contour maps for averaged criteria and peak position histograms in the high-noise condition.

Hyperparameter Selection for Self-Organizing Maps

633

Figure 4: Samples of centroid configurations for several hyperparameter values in the low-noise condition.

In the case of high noise (see Figure 3), the discrepancies among the criteria increase. While cross-validation has a bias toward low α again, evidence comes to have the opposite bias. In this case, we have difficulty choosing between the methods. However, the property that evidence leads to the simplest model unless sufficient data for structure determination are given may be desirable because it agrees with a general strategy of data analysis that linear models rather than nonlinear ones should be used for very noisy data.

634

Akio Utsugi

5 Conclusion We derived the SOM algorithm as an approximate MAP estimation algorithm for a GDM. Several methods to evaluate the quality of hyperparameters for this model were developed by empirical Bayesian estimation and cross-validation. These methods were compared through simulation studies. It was found that the cross-validation methods favor complex structures, while the empirical Bayesian methods favor simple ones. Because of these properties and the long calculation time for cross-validation, the empirical Bayesian methods are recommended for this model. Appendix In this appendix, we consider the relation between kernel smoothing and discretized Laplacian smoothing. Initially, we show

KN 1r = 1r ,

(A.1)

where 1r is an r-dimensional column vector with all ones. In general, D 1r = 0 since D is a differential operator. Thus, using equation 3.7, (KN )−1 1r = (I r + γ N −1 D 0 D )1r = 1r .

(A.2)

This means that (KN )−1 has 1r as an eigenvector with the eigenvalue one; thus KN also have the same property. This leads to equation A.1. From equations 3.6 and A.1, a curve smoothed by discretized Laplacian smoothing is expressed by the kernel smoothing form (cf. equation 3.8) using the entries of K as the values of kernel function κ(s, t). References Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc., Ser. B, 39, 1–38. Durbin, R., Szeliski, R., & Yuille, A. (1989). An analysis of the elastic net approach to the traveling salesman problem. Neural Computation, 1, 348–358. H¨ardle, W. (1990). Smoothing techniques with implementation in S. Berlin: SpringerVerlag. Kohonen, T. (1988). Self-organization and associative memory (2nd ed.). Berlin: Springer-Verlag. Kohonen, T. (1990). The self-organizing map. Proc. IEEE, 78, 1464–1480. MacKay, D. J. C. (1992). A practical Bayesian framework for backprop networks. Neural Computation, 4, 448–472. McLachlan, G. J., & Basford, K. E. (1988). Mixture models: Inference and applications to clustering. New York: Marcel Dekker.

Hyperparameter Selection for Self-Organizing Maps

635

Mulier, F., & Cherkassky, V. (1995). Self-organization as an iterative kernel smoothing process. Neural Computation, 7, 1141–1153. Nowlan, S. J. (1990). Maximum likelihood competitive learning. In Advances in neural information processing systems 2 (pp. 574–582). San Mateo, CA: Morgan Kaufmann. O’Sullivan, F. (1991). Discretized Laplacian smoothing by Fourier methods. J. Amer. Statist. Assoc., 86, 634–642. Silverman, B. W. (1984). Spline smoothing: The equivalent variable kernel method. Ann. Statist., 12, 898–916. Utsugi, A. (1993). A Bayesian model of topology-preserving map learning (in Japanese). Trans. IEICE D-II, J76-D-II, 1232–1239. Utsugi, A. (1994). Lateral interaction in Bayesian self-organizing maps (in Japanese). Trans. IEICE D-II, J77-D-II, 1329–1336. Yuille, A. L. (1990). Generalized deformable models, statistical physics, and matching problems. Neural Computation, 2, 1–24. Yuille, A. L., Stolorz, P., & Utans, J. (1994). Statistical physics, mixture of distributions and the EM algorithm. Neural Computation, 6, 334–340.

Received November 7, 1995; accepted May 7, 1996.

Communicated by Javier Movellan

Supervised Networks That Self-Organize Class Outputs Ramesh R. Sarukkai University of Rochester, Rochester, NY 14627 USA

Supervised, neural network, learning algorithms have proved very successful at solving a variety of learning problems; however, they suffer from a common problem of requiring explicit output labels. In this article, it is shown that pattern classification can be achieved, in a multilayered, feedforward, neural network, without requiring explicit output labels, by a process of supervised self-organization. The class projection is achieved by optimizing appropriate within-class uniformity and between-class discernibility criteria. The mapping function and the class labels are developed together iteratively using the derived selforganizing backpropagation algorithm. The ability of the self-organizing network to generalize on unseen data is also experimentally evaluated on real data sets and compares favorably with the traditional labeled supervision with neural networks. In addition, interesting features emerge out of the proposed self-organizing supervision, which are absent in conventional approaches.

1 Motivation Supervised learning in conventional multilayered neural networks can be formulated without explicitly specifying output label functions. The goal of supervised learning systems is to determine a function F (X ), which “agrees” with the training examples of the form hXI , G (XI )i. Usually the function F ( ) takes on values from a discrete set of “classes”: {CO , . . . , CK }. Many algorithms have been proposed for such a learning task: decision tree methods such as CART (Breiman et al. 1984) and multilayered feedforward networks (Rumelhart et al. 1987) trained by the “error-backpropagation” algorithm, to name just two. In practice, however, there is an intermediate mapping that is needed, namely, label assignments. Formally, distinct functions, LCI , are defined for each class CI and are referred to as the labels. Thus, the learning system finds an approximate definition of the function F (X ) such that, given training examples, hXI , G (XI )i, L(F (XI )) = L(G (XI )). The popular error-backpropagation algorithm (Rumelhart et al. 1987), for training multilayered, feedforward, artificial neural networks, has been formulated on the explicit assumption that an output labeling function L( ) is available. Neural Computation 9, 637–648 (1997)

c 1997 Massachusetts Institute of Technology °

638

Ramesh R. Sarukkai

Biologically, output labels are available for certain, but not all, inputoutput mappings. All of the above schemes share a common deficiency in that the labeling function, L( ), has to be predetermined. Why not formulate self-organizing systems, which develop their own output codes based on the supervisory signals? This article presents an approach whereby conventional multilayered feedforward networks, under supervision, self-organize their own output labels. In particular, I show that supervised learning with neural networks can be viewed as a process of dimensionality reduction, with additional constraints: a within-class constraint specifying that the output labels (or projections to the output space) must be the same for inputs belonging to the same class and a between-class constraint specifying that the output labels should be different for different classes. Thus, the role of supervision is only to suggest to the learning machine which inputs belong together and which do not. 2 Separability Enhancement by Nonlinear Mappings The idea of viewing supervised learning as a process of class separability enhancing, dimensionality reduction was first presented in Fisher’s (1936) classic work, which was the origin of discriminant analysis. Class-enhancing projections using nonlinear mappings have been extensively researched in classical feature extraction theory. Typically iterative techniques proceed by starting with random values and search for the “good” projections based on steepest-descent methods using criteria functions (Duda & Hart 1973; Fukunaga 1972; Young & Calvert 1974) such as monotonicity, stress, and continuity. The criteria functions are usually structure-preserving measures. The algorithms terminate when no further improvement can be obtained in the projections, and the mapping function is then usually found by a least-squares technique. In the neural network literature (Rumelhart et al. 1987; Hertz et al. 1991), supervised learning is usually implicitly tied to the output labeling function. Perhaps this was due to the ability of elegantly formulating gradient weight adaptation equations (i.e., error is related to teaching output−actual output). Independent of my work, discriminant analysis neural networks have been studied (Kuhnel & Tavan 1991; Mao & Jain 1993); however, their formulation is based on principal component analysis (PCA) networks, and cannot be generalized to multilayered, feedforward networks. The learning vector quantization (LVQ) algorithms proposed by Kohonen (Hertz et al. 1991) can also be grouped with the class of discriminant analysis networks in discussion. Although supervision is used, the goal of LVQ is to find appropriate cluster centers in the input space rather than dimensionality reduction-enhancing class separability. In this article, a different method of achieving supervised nonlinear projection is implemented with a conventional, multilayered feedforward network. Unlike existing iterative approaches to nonlinear class separability

Supervised Networks That Self-Organize Class Outputs

639

enhancement, the presented implementation of supervised projection using neural networks enables the simultaneous development of the whole nonlinear mapping function along with the class labels. 3 Supervised Projection Using Neural Networks In this section, the learning algorithm that has been implemented is discussed briefly. For the sake of clarity, simple functions were chosen, although the basic idea can be extended to more complex and general functions. Let Opj be the activation state of the jth output after the presentation of pattern p. Let Äk be the set of inputs belonging to class k. Ep1 ,j refers to the error with respect to pattern p1 for output unit j. The within-class criterion used is ∀Äk , k = 1, . . . , n ∀p1 ∈ Äk . Minimize 1 X [Op1 j − Op2 j ]2 . 2 ∀p ∈Ä

Ep1 ,j =

2

(3.1)

k

Note that the within-class criterion is minimized when output responses for patterns belonging to the same class are identical—that is, when Op1 j = Op2 j . The between-class criterion is defined to be ∀Äk , k = 1, . . . , n ∀p1 ∈ Äk . Minimize X

Ep1 ,j =

1 − [Op1 j (1 − Op2 j ) + Op2 j (1 − Op1 j )].

(3.2)

∀p2 6∈Äk

Note that the between-class criterion is minimized when the output activations for patterns belonging to different classes are different—that is, when Op1 j 6= Op2 j . In a previous paper (Sarukkai 1994), I derived the following learning rule: 1Wij =

i X ∂Op1 j h η1 |Äk |[OÄk j − Op1 j ] + η2 |Äk |[1 − 2OÄk j ] , (3.3) ∂Wij ∀k,p ∈Ä 1

k

640

Ramesh R. Sarukkai

where η1 and η2 are learning rates.1 OÄk j is the thresholded, average output

activation of unit j for all patterns not belonging to class Äk , and |Äk | refers to the cardinality of the set Äk . OÄk j is the thresholded, average activation of unit j for class Äk , and |Äk | refers to the cardinality of the set Äk . Thresholded average activation is the threshold function applied to the average of the activations of a unit for input patterns belonging to the appropriate subset of the training data. The advantage of my formulation is that gradient weight adaptation rules can be easily implemented in a neural network backpropagation context. The derived weight adaptation rule does have some intuitive explanations. In the first criterion, it can be seen that the average activation of a unit for its class acts as the teaching signal. From the second criterion, it may be seen that if the average activation of a unit for “other classes” is greater than 0.5, then the term [1 − 2OÄk j ] is negative (equivalent to a teaching signal of 0 for the “current” class), and vice versa. Thus, the equations clearly express the two criteria lucidly. The cardinality terms may be viewed as weighing constants related to the training set size. In some cases, the within-class criterion and the between-class criterion are “opposing” quantities: if the class means for two classes are equal, then the effect of the between-class discernibility function is cancelled out by the within-class uniformity criterion. This problem can be alleviated by separating the training procedure into two phases, as shown in Table 1. The up arrow indicates that the value of the corresponding learning parameter is effectively higher, and the down arrow indicates that the value of the corresponding learning parameter is effectively lower during that learning phase. In the first phase, means of outputs for inputs belonging to the same class are computed. These class means are moved away from each other by applying the class discernibility phase of the algorithm, until the thresholded class means are orthogonal to each other. In practice, we find convergence. At this point, the within-class uniformity phase begins, setting a very low value for η2 (typically zero, although a nonzero value ensures that the orthogonality of the output class labels is not immediately lost) and a higher value for η1 . Ideally, the system is allowed to switch back and forth between the two phases until orthogonal output class means with very low intraclass distances are achieved. Thus, both the functional mapping and the output class labels are iteratively developed at the same time. Since average activation values for the patterns are required, an additional batch forward pass may be required, although the averages computed from the previous pass can be used. In practice, it is better to do pairwise class discriminative training.

1 Without loss of generality, the activations are assumed to be bounded (0,1) in this particular formulation.

Supervised Networks That Self-Organize Class Outputs

641

Table 1: Overview of the Algorithm Algorithm SOS: Self-organizing supervision with neural nets Class discernibility phase: Push between-class output cluster means away from each other until thresholded means for different classes are orthogonal. η2 ↑, η1 ↓ Class uniformity phase: Move within-class outputs toward class cluster means. η2 ↓, η1 ↑

Thus, the learning algorithm performs gradient descent on the sum of the between-class discernibility and within-class uniformity criteria; however, only one criterion has a dominant effect on the learning phase at a given time. It can be analytically proved that the maximization of the interclass distance defines an MSE (mean squared error) linear discriminant function (Young & Calvert 1974). Thus, convergence can be guaranteed when the selfcoding algorithm is applied to a linearly separable two-class problem using a single-layered perceptron. In other cases, once orthogonal class means are achieved and the self-organized codes stabilize, the method is equivalent to learning with fixed output labels; however, since the initial configuration of the system determines the output codes developed, it would be interesting to study whether the proposed scheme is less prone to local minima than traditional labeled supervision. The computational complexity of this batch algorithm is the same as traditionally labeled algorithms, if the required mean activations are computed from a previous epoch. For the experiments presented, orthogonality was generally achieved within a few hundred epochs, and convergence was achieved within a few thousand epochs. 4 Experiment I: Iris Data The idea was to test the classification of the three-class iris data set using the self-organization process detailed earlier. The data set contains three classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other two, but the latter are not linearly separable from each other. The input attributes are sepal length, sepal width, petal length, and petal width. The network consisted of four inputs, four hidden units, and three output units. Sixty trials were performed with random initial conditions (weight range [−3.7, 3.7]). Training proceeded until the total squared error was less than 1.0 (one trial failed to converge). In the 59 successful trials, the supervised self-organization scheme achieved an average classification of 98.96% (maximum, 99.33%; minimum, 98.67%). Sammon’s stress (PCA, LDA, and NDA; results are from Mao & Jain 1993) measures how well the projection preserves the interpattern distances

642

Ramesh R. Sarukkai

Table 2: Sammon’s Stress Obtained for the Iris Data Set Sammon’s stress

PCA

LDA

NDA

Self-organized

Average

0.147

0.519

1.513

0.510

(i.e., input space versus output space).2 Sammon’s stress (Sammon 1969) is defined as E=

1 n−1 n 6µ=1 6υ=µ+1 d∗ (µ, υ)

n−1 n 6µ=1 6υ=µ+1

[d∗ (µ, υ) − d(µ, υ)]2 , d∗ (µ, υ)

(4.1)

where d∗ (µ, υ) and d(µ, υ) are the (Euclidean) distances between pattern µ and pattern υ in the input space and in the projected space, respectively. It can be seen from Table 2 that good values for Sammon’s stress have been obtained by the nonlinear self-organized projections, suggesting that the distances between patterns in the input space are captured, to some degree (comparable to an LDA network), in the self-organized output class projections. 5 Experiment II: Multispeaker Syllable Classification In this experiment, described in more detail in Sarukkai (1994), four syllables were used: ba, pa, da, and ga. The total number of speaker-dependent speech tokens was 240; half were used for training and the other half for testing. Eighty speech tokens, from two speakers, which were not used in the training set, constituted the new speaker test set. A total of 24 Melscale filter bank coefficients were computed every 20 ms duration (window offset, 10 ms), and this constituted a total of 192 parameters, which were then linearly normalized. The network architecture consisted of 192 input units, one hidden layer of 50 units, and 6 output units. Although the data used in experiments II and III are fixed length, the algorithm can be applied in conjunction with dynamic input length architectures like time delay neural networks (TDNNs). The benchmark neural net was trained with 4-bit (one per class) and 2-bit (encoded) output codings. Euclidean distance was used for classification. Tables 3 and 4 show the results for the multispeaker syllable task for new speaker and trained speaker test sets (13 trials supervised self-organization; 15 trials labeled supervision).

2 LDA: linear discriminant analysis; NDA: nonlinear discriminant analysis; PCA: principal component analysis.

Supervised Networks That Self-Organize Class Outputs

643

Table 3: Multispeaker Syllable Recognition Results (in % with Standard Deviations)

Self-organizing backprop Normal backprop (2-bit output) Normal backprop (4-bit output)

New speakers

trained Speakers

67.69 ± 7.11 65.25 ± 7.34 69.17 ± 6.14

77.5 ± 2.75 77.11 ± 4.25 83.17 ± 3.19

Table 4: Some Output Labels Produced by the Class Separability Scheme (Experiment II) Token ba pa da ga

1

2

3

100100 100110 010001 010011

010101 110100 000011 100011

011110 110110 000001 100001

Note: The italics accentuate the differences in the output values.

6 Experiment III: Multimodal Data The database for this data set, described in more detail in deSa (1994), consists of auditory and visual features recorded from many speakers repeating one of the syllables: ba, va, da, ga, or wa. Multimodal data obtained from four speakers were used. The acoustic data were low-pass filtered and segmented automatically (using the time-domain wave magnitude program available in ESPS software from Entropic Research Laboratory). In order to ensure that all the consonantal information was retained, 50 ms from before and after the detected utterance was segmented. These utterances were then encoded using a 24-channel mel code over 20 ms windows overlapped by 10 ms. This gave a 216-dimension auditory code for each utterance. The visual frames were digitized as 64×64 8-bit gray-level images using the Datacube MaxVideo system (see Figures 1 and 2). The segmentation obtained from the acoustic signal was used to segment the video six frames before acoustically determined utterance onset and four after. The normal flow was

Figure 1: Example of da utterance.

644

Ramesh R. Sarukkai

Figure 2: Example of ga utterance.

Figure 3: Network architecture used for multimodal experiments.

computed using differential techniques between successive frames. Finally, using averaging techniques, five frames of motion over the 64 × 64–pixel grid were obtained. The frames were then divided into 25 equal areas (5 × 5) and the motion magnitudes within each frame averaged within each area. Thus, a final visual feature vector of dimension (5 frames ∗ 25 areas) 125 was obtained. Each input pattern consisted of the concatenated 216-dimensional auditory channel vector and the 125-dimensional visual parameter vector. The total number of patterns were 500. They were divided arbitrarily into 350 training samples and 150 test patterns, so that all speakers were represented in the training data. The network architecture used for this problem is shown in Figure 3. The benchmark was to train the network with two-per-class output coding. Results tabulated (see Table 5) are from 39 trials for the backpropagation with 2-per-class coding, and 42 trials for the self-organizing coding scheme. The

Supervised Networks That Self-Organize Class Outputs

645

Table 5: Test Set Results on the Multimodal Data Using Different Methods (%) VQ A:68,V:36

LVQ2.1

M-D

BP with output

Self-organize

A:97,V:60

A:92,V:58

B:93.98 ± 1.02

B: 91.97 ± 1.99

Note: A: auditory data only. V: Visual data only. B: Both modalities used.

results of the self-organizing scheme compare favorably with the labeled supervision scheme. The mean results for the same data set using Kohonen’s supervised LVQ, unsupervised VQ, and the M-D algorithms are cited from deSa (1994). For the LVQ and M-D tests, 30 codebook vectors were used for the auditory and 60 for the visual data. LVQ2.1 (which is supervised) has the best performance on the test sets, mainly because 30/60 codebook vectors seem sufficient to capture the 350 training samples well. In fact, the self-organization process goes further than combining features from both modalities. The closeness of patterns in the input space is reflected in the output encodings that the supervised self-organization scheme develops. Figure 4 shows the statistics of the interclass Hamming distances for the five classes as measured by the codes developed by the self-organizing network over 42 trials. The closest output coding (based on interclass Hamming distance) is (da, ga). This is interesting since experiments (Miller 1955) indicate that the syllables da and ga are perceptually often confused. Da and ga are also visually extremely alike (see Figures 1 and 2). It must be noted that although da, ba, and ga are all voiced stops, ba can be distinguished from the other two easily from the visual modality. In general, the place of articulation can be seen easily, but it is the hardest feature to hear. The pair with the next lowest interclass Hamming distance is the (va, ba) pair. Va is a voiced front fricative and is visually similar to ba. Although the production of ba requires that the lips be closed before the voice release, this visual cue is absent in the motion flow parameters that are computed from the visual images and used as inputs for the visual modality. The point of Figure 4 is to suggest that the self-organized codes are not totally random and seem to be biased by factors such as interpattern relationships in the input space. 7 Conclusions and Future Directions Supervised learning in conventional multilayered neural networks can be formulated without explicitly specifying output label functions. Learning rules were defined on the basis of between-class and within-class error criteria. The actual criteria could be extended to incorporate more general constraints such as error-correcting output labels. The proposed scheme has

646

Ramesh R. Sarukkai

Figure 4: Average interclass Hamming distances with standard deviations for the various class pairs developed by the self-organizing network.

been experimentally tested on various data sets. Interesting features emerge in the output representations of such self-organizing networks: some of the “intrinsic” or “natural” distances are somewhat preserved in the output projections. As a proof of concept, it has been shown that networks can selforganize in a supervised manner and develop their own interesting output labels. Appendix: Details of Experiments The cardinality terms were factored into the learning rates. α is the momentum factor. The activation function used is 1/(1+e−0.667∗x ). Sigmoid prime offset was 0.1. Weights initialized [−3.7 : 3.7]. I. Iris Experiment Number of classes: 3 Network architecture: 4 inputs, 4 hidden, 3 outputs Class discernibility phase parameters: η1 = 0.00; η2 = 0.01; α = 0.00

Supervised Networks That Self-Organize Class Outputs

647

Class uniformity phase parameters: η1 = 0.20; η2 = 0.00; α = 0.8 II. Multispeaker Syllable Experiment Number of classes: 4 Network architecture: 192 inputs, 50 hidden, 6 outputs Class discernibility phase parameters: η1 = 0.00; η2 = 0.09; α = 0.1 Class uniformity phase parameters: η1 = 0.08; η2 = 0.02; α = 0.6 Benchmark parameters: η = 0.1; α = 0.1 initially; α = 0.6 after 200 epochs III. Multimodal Data Experiment Number of classes: 5 Network architecture: 216 auditory plus 125 visual inputs, 40 hidden, 10 outputs Class discernibility phase parameters: η1 = 0.00; η2 = 0.01; α = 0.00 Class uniformity phase parameters: η1 = 0.20; η2 = 0.00; α = 0.8 Benchmark parameters: η = 0.20; α = 0.8 Acknowledgments I thank Dana H. Ballard for his valuable comments on this work. I am also grateful to Anil K. Jain for his important suggestions and pointers to related work. Also, thanks to Virginia deSa for providing the multimodal data. Thanks to the anonymous reviewers for their constructive criticism and insightful comments. This material is based on work supported by the National Science Foundation under grant IRI-8903582. The government has certain rights in this material. References Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth and Brooks. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugen., 7, 178–188. Fukunaga, K. (1972). Introduction to statistical pattern recognition. Orlando, FL: Academic Press.

648

Ramesh R. Sarukkai

Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Kuhnel, H., & Tavan, P. (1991). A network for discriminant analysis. In T. Kohonen, K. Makisara, O. Simula, & J. Kangas (Eds.), Artificial neural networks. New York: Elsevier Science. Mao, J., & Jain, A. K. (1993). Discriminant analysis neural networks. In Proceedings of the IEEE International Conference on Neural Networks, San Francisco, Vol. 1, pp. 300–305. Miller, R. G. (1955). Perceptual confusions among consonants. Journal of the Acoustical Society of America, 27, 338–352. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1987). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Sammon, J. R., Jr. (1969). A non-linear mapping for data structure analysis. IEEE Trans. Comput., C-18, 401–409. Sarukkai, R. (1994). Supervised learning without output labels. (Tech. Rep. No. 510). University of Rochester, Department of Computer Science. deSa, V. (1994). Unsupervised classification learning from cross-modal environmental structure. Doctoral dissertation, URCS TR#536, University of Rochester, Rochester, NY. Young, T. Y., & Calvert, T. W. (1974). Classification, estimation, and pattern recognition. New York: American Elsevier Publishing Company.

Received November 14, 1994; accepted May 6, 1996.

Communicated by Lawrence Abbott

How Well Can We Estimate the Information Carried in Neuronal Responses from Limited Samples? David Golomb Zlotowski Center for Neuroscience and Department of Physiology, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel

John Hertz Nordita, Copenhagen, Denmark

Stefano Panzeri Department of Experimental Psychology, Oxford University, Oxford OX1 3UD, U.K.

Alessandro Treves SISSA, Biophysics and Cognitive Neuroscience, Trieste, Italy

Barry Richmond Laboratory of Neuropsychology, NIMH, NIH, Bethesda, MD, USA

It is difficult to extract the information carried by neuronal responses about a set of stimuli because limited data samples result in biased estimates. Recently two improved procedures have been developed to calculate information from experimental results: a binning-and-correcting procedure and a neural network procedure. We have used data produced from a model of the spatiotemporal receptive fields of parvocellular and magnocellular lateral geniculate neurons to study the performance of these methods as a function of the number of trials used. Both procedures yield accurate results for one-dimensional neuronal codes. They can also be used to produce a reasonable estimate of the extra information in a three-dimensional code, in this instance, within 0.05–0.1 bit of the asymptotically calculated value—about 10% of the total transmitted information. We believe that this performance is much more accurate than previous procedures. 1 Introduction Quantifying the relation between neuronal responses and the events that have elicited them is important for understanding the brain. One way to do this in sensory systems is to treat a neuron as a communication channel (Cover & Thomas 1991) and to measure the information conveyed by the Neural Computation 9, 649–665 (1997)

c 1997 Massachusetts Institute of Technology °

650

David Golomb et al.

neuronal response about a set of stimuli presented to the animal. In such experiments (e.g., Gawne & Richmond 1993; McClurkin et al. 1991; Tov´ee et al. 1993) a set of S sensory (e.g., visual) stimuli is presented to the animal, each stimulus being presented for NS trials. After the neuronal response is quantified in one of several ways (e.g., the number of spikes in a certain time interval or a descriptor of the temporal course of the spike train), the transmitted information (mutual information between stimuli and responses) is estimated. This approach is useful for investigating issues such as the resolution of spike timing (Heller et al. 1995), the effectiveness of encoding for stimulus sets (Optican & Richmond 1987; McClurkin et al. 1991; Rolls et al. 1996a, 1996b), and the relations between responses of different neurons (Gawne & Richmond 1993). The equation for transmitted information can be written in several ways, including: ¸ · Z X P(s, r) P(s, r) log2 I(S; R) = dr P(s)P(r) s * =

X s

· P(s | r) log2

P(s | r) P(s)

¸+ .

(1.1)

r

Here P(s, r) is the joint probability of stimulus s and response r, P(s | r) is the conditional probability of stimulus s given response r, and P(s) and P(r) are the unconditional stimulus and response probabilities, respectively. The stimulus index s is discrete, as there is a finite number of stimuli. The response measure r can be either discrete or continuous, depending on the way the response is quantified. The notation h· · ·ir in the second form indicates an average over the (unconditional) response distribution. Computing I(S; R) requires estimates of these joint or conditional probabilities and carrying out the appropriate integration over response space or the average over the response distribution. A large number of trials is needed for accurate information calculations. When the number of data samples is small, there are systematic errors in estimating the transmitted information by direct application of the formal definition, that is, by binning responses and estimating the relevant probabilities as the empirical bin frequencies (Miller 1955; Carlton 1969; Optican et al. 1991; Treves & Panzeri 1995; Abbott et al. 1996; Panzeri & Treves 1996). With a small number of data samples, the information is overestimated; the information is biased upward. Intuitively, this can be understood by noting that the number of bins can exceed the number of data samples, especially as the dimensionality of the response increases. The data samples must be evenly distributed across the bins to yield zero information, a circumstance that cannot occur when the number of bins exceeds the number of data samples. On the other hand, if the number of bins is too small, responses that should be reliably distinguishable will not be, leading to information values

Estimating Information from Neuronal Responses

651

that underestimate the true values. These effects are magnified when the dimensionality of response increases, and thus even more samples are needed to make accurate information estimates when the response dimensionality is larger than one. In practice, the size of data sets is limited by experimental constraints. Thus it is important to try to correct for these systematic errors and to design experiments so that a sufficient number of data samples are obtained to answer reliably the questions posed. Only then can we make effective use of mutual information measurements. Here, we test procedures designed to overcome the systematic errors that arise with limited data sets by comparing mutual information calculations from large data sets with those calculated from smaller ones. Currently the only way to get such large data sets is through simulations. Artificial spike trains are created using a model of the response of lateral geniculate nucleus (LGN) neurons (Golomb et al. 1994). The model is based on experimentally measured spatiotemporal receptive fields of LGN neurons (Reid & Shapley 1992). Two procedures are tested here. The first is a binning-and-correcting procedure; the information is calculated by direct application of the first definition in equation 1.1, and a correction value is estimated and subtracted (this method attacks the bias problem) (Panzeri & Treves 1996; Treves & Panzeri 1995). The second method uses a neural network to estimate the conditional probabilities in the second definition in equation 1.1 directly (Heller et al. 1995; Hertz et al. 1992; Kjaer et al. 1994). We find that both methods yield accurate results for one-dimensional codes, even for a relatively small number of samples. Moreover, estimates of the extra information carried in three-dimensional codes are also reasonable, within 0.05–0.1 bits (about 10%) of the correct values. 2 Methods 2.1 Producing Simulated Data. The spike trains are created using a model of the response of parvocellular and magnocellular LGN cells, as described in Golomb et al. (1994). In brief, the spatiotemporal receptive fields R(Er, t) of the two cells types in response to an impulse in space and and time were measured (Reid & Shapley 1992; data presented in Figure 1 of Golomb et al. 1994). The set of stimuli {σs } , s = 1 · · · S(=32) includes 4 × 4 flashed Walsh figures in space and their contrast reverse, σs (Er, t) = us (Er)2(t),

(2.1)

where us (Er) are the spatial Walsh figures and 2(t) is the Heavyside function (2(t) = 1 if t > 0 and is 0 otherwise). The ensemble-average response to the sth figures Zs (t) is calculated by centering it on the receptive field center, convolving it with the spatiotemporal receptive field, adding the constant

652

David Golomb et al.

baseline Z0 corresponding to the spontaneous firing, and rectifying at zero response: · Zs (t) = 2 Z0 +

Z

Z dEr

t

−∞

¸ ¡ ¢ ¡ 0¢ 0 dt R Er, t − t σ Er, t . 0

(2.2)

The response Zs (t) for Walsh figures is shown in Figure 5 of Golomb et al. (1994). Realizations of spike trains are created at random with inhomogeneous Poisson statistics, using the average response as the instantaneous rate. The probability density of obtaining a spike train 3s (t), with k spikes at times t1 . . . tk during a measurement time T, is ¢ ¡ P (3s (t) | σs ) = P t1 . . . tk | σs (Er, t) " ! # Ã Z k T ¡ ¢ 1 Y 0 0 = Zs (ti ) exp − Z t dt . k! i=1 0

(2.3)

A set of 1024 simulated responses for each of the 32 stimuli is used for testing the information calculation procedures. The asymptotic estimate of transmitted information is calculated using 106 trials per stimulus. 2.2 Response Representation. The neuronal response to a stimulus as represented by the spike train is quantified by several variables. One is the number of spikes (NOS) in the response time interval, taken here to be 250 ms. The others are the projection of the spike train into the n principal components (PCs) (Richmond & Optican 1987; Golomb et al. 1994). We concentrate here on the first principal component (PC1) and the first three principal components (PC123). The four-dimensional code composed of the first three principal components and the number of spikes (PC123s) is also considered. 2.3 Information Estimation. We describe here briefly some of the methods that can be used for estimating information from neuronal responses. We simulate an experiment in which a set of S stimuli is presented at random. Each stimulus is shown Ns times; here Ns is the same for all the stimuli. The total number of visual stimuli presented is N = SNs . 2.3.1 Summation over the Poisson Distribution. For each stimulus here, the number of spikes NOS is Poisson distributed with an average NOS = RT 0 Zs (t)dt. The asymptotic value of the transmitted information carried by the NOS can be calculated directly by summing over the distribution.

Estimating Information from Neuronal Responses

653

Equation 1.1 becomes X

I (S; NOS) = −

P (NOS) log2 P (NOS)

(2.4)

NOS

+

1 XX P (NOS | s) log2 P (NOS | s) . Ns s NOS

This sum is discrete and is calculated using the Poisson probability distribution P (NOS | s). The sum over NOS from 1 to ∞ is replaced by a sum from 1 to NOSmax = 36; taking a higher NOSmax has only a negligible effect on the result. Using this method, the mutual information can be calculated exactly, but only when the firing rate distribution is known. In real experiments, the firing rate distribution is unknown, and therefore the methods described below should be used. 2.3.2 Straightforward Binning (Golomb et al. 1994). The principal components used here were calculated from the covariance matrix C(t, t0 ) formed over all responses in the set under study Ns S X £ ¤£ ¤ ¡ ¢ 1 X ¯ ¯ 0) , 3s,µ (t) − 3(t) 3s,µ (t0 ) − 3(t C t, t0 = SNs s=1 µ=1

(2.5)

where 3s,µ (t) is the µth realization of the response to the sth stimulus and ¯ 3(t) is the average response over all the stimuli and realizations ¯ 3(t) =

Ns S X 1 X 3s,µ (t). SNs s=1 µ=1

(2.6)

The eigenvalues of the matrix C are labeled according to a decreasing order; the corresponding eigenvectors are 81 (t), 82 (t). . . . The expansion coefficients of the neuronal response 3s,µ (t) are given by as,µ,m =

1 T

Z

T

dt 3s,µ 8m (t).

(2.7)

0

Each response is then quantified using the coefficients to the first n principal components, and these are used as the response representation. The number n of coefficients used for quantifying the response is referred to here as the code dimension. The maximal and minimal values for each component are found, and the interval between the minimum and the maximum of the mth component is divided into R(m) bins. The mutual information is calculated from the discrete distribution obtained. The N-dimensional reQ sponse space is therefore divided into R = nm=1 R(m) N-dimensional bins.

654

David Golomb et al.

For PC123, we choose R(1) = 36, R(2) = 20, and R(3) = 20. The mutual information carried by the first principal component only, PC1, is calculated in a similar way with R(1) = 36. Note that binning is a simple form of regularization, and some kind of regularization is always needed when response measurements span a continuous range (the number of spikes is discrete). Regularization results in a downward bias of the calculated information value (see section 1). For PC1 and our simulated data set, using R(1) = 36 results in underestimating the mutual information by ∼ 0.01 bit in comparison to R(1) = 300. 2.3.3 Binning with Finite Sampling Correction. improved by doing the following:

The binning method is

1. For each dimension, equipopulated bins are used. For a one-dimensional code (NOS, PC1), the bin size varies across the response dimension, with nonequal spacing, so that each bin gets on average the same number of counts. For a three-dimensional code (PC123), the equipopulated binning is done for each dimension separately (Panzeri & Treves 1996). 2. The systematic error due to limited sampling can be expanded analytically in powers of 1/N. Treves and Panzeri (1995) have shown that the first-order correction term C1 carries almost all the error, provided that N is large enough. Therefore, we correct the information estimation by subtracting C1 from the result calculated by raw equipopulated binning. The term C1 is expressed as (Panzeri & Treves, 1996) ( ) X 1 es − R − (S − 1) , R C1 = 2Nln 2 s

(2.8)

es is the number of “relevant” response bins for the trials with where R stimulus s. es differs from the total number of bins R The number of relevant bins R allocated because some bins may never be occupied by responses to a particular stimulus. Thus, if the term C1 is calculated using the total number of bins R for each stimulus, the systematic error is overestimated whenever there are stimuli that fail to elicit responses spanning the full response set. The number of relevant bins also differs from the number of bins actually occupied for each stimulus (with few trials), Rs , because more trials might have occupied additional bins. Again, C1 is underestimated if Rs is used when only a few trials are available (the underestimation becoming es for all stimuli). negligible for R/Ns ¿ 1 because Rs tends to coincide with R es (a number between Rs and R) from the data by assumWe estimate R ing that the expectation value of the number of occupied bins should be es . The precisely Rs given the number of trials available and the estimate R

Estimating Information from Neuronal Responses

655

Table 1: Number of Bins R Used for Codes and Numbers of Trials Ns Ns

16

32

64

128

R for NOS R for PC1 R = R(1) × R(2) × R(3) for PC123

16 16

1 + NOSmax 36

1 + NOSmax 63

1 + NOSmax 128

4×2×2

6×3×2

7×3×3

8×4×4

algorithm for achieving this result is explained in Panzeri and Treves (1996). This method for correcting the information values in the face of a limited number of data samples yields reasonable results even up to R/Ns ' 1, the region in which we use the method for the study presented here. Equation 2.8 depends on the probability distributions much more weakly than does the mutual information itself because the dependence is only es have to es . Therefore, although the parameters R through the parameters R be estimated from the data, this procedure leads to better accuracy. The choice of the number of bins R(m) in each dimension for an experiment with S stimuli and Ns trials per stimuli remains somewhat arbitrary. Here we choose R ∼ Ns , to be at the limit of the region where the correction procedure is expected to work, and thus still be able to control finite sampling, while minimizing the downward bias produced by binning into too few bins. For NOS, however, each response is just an integer ranging from 0 to the maximal number of spikes (NOSmax —25 for the parvocellular cell and 34 for the magnocellular cell), so even if we allocate more bins than this maximum, the extra ones will stay empty. For a multidimensional code (e.g., PC123), we allocate a number of bins R(m) in the mth direction in relation to the amount of mutual information carried by this principal component alone, as shown in Table 1. When differences between different codes are calculated, we use the same number of bins in the relevant dimension. When PC1 and NOS are compared, we use the same number, R, as for PC1; in this case, many bins for NOS stay empty. For comparing PC123 and NOS, we use the same number of bins for NOS as for the first principal component, which is the richest in information among the three (e.g., 8 for Ns = 128). In this way we compare quantities calculated in a homogeneous way. 2.3.4 Neural Network. A two-layer network is trained by backpropagation to classify the neuron’s responses according to the stimuli that elicited them (Hertz et al. 1992; Kjaer et al. 1994). The network uses sigmoidal activation for the nodes in the hidden layer and exponential activations for the nodes in the output layer, with the sum of the outputs normalized to one after each step. The input to the network is the quantified output of the biological neuron: the spike count, the first n principal components, or both. There is one output unit for each stimulus, and, after training,

656

David Golomb et al.

ˆ the value of output unit number s is an estimate P(s|r) of the conditional probability P(s | r). The mutual information is then calculated from these estimates using the second form of equation 1.1. The average over the response distribution is estimated by sampling over randomly chosen data points. All other parameters of the algorithm—notably the number of training iterations and the input representation, are controlled by cross-validation. The data are divided into training and test sets, and for each representation, the training is stopped when the test error, defined as E=−

X µ

ˆ µ|rµ ) log2 P(s

(2.9)

(the negative log-likelihood or crossed-entropy), reaches a minimum. In equation 2.9, the index µ labels the trials in the test set, and sµ is the stimulus that actually evoked the response rµ observed in that trial. After carrying out this procedure for four train-test data splits, the optimal response representation is chosen as the one with the lowest average test error. In the present calculations, as in previous work using the neuronal responses from neurons in the monkey visual system (Heller et al. 1995), the optimal representation consists of the spike count plus the first three principal components. Including the spike count leads to smaller crossed-entropy values than those obtained from the three PCs alone. 3 Results We calculated the information carried about a set of 32 Walsh patterns by the simulated neuronal response quantified by the number of spikes I (S; NOS), the first principal component I (S; PC1), and the first three principal components I (S; PC123) (see Figure 1). For the network technique, we calculated also the information carried by the first three principal components and the spike count together I (S; PC123s) (see Figure 2). The arrows at the right side of the panels in Figure 1 represent the asymptotic values calculated from simple binning using 106 trials per stimulus. These figures show the estimated transmitted information for Ns = 16, 32, 64, 128, and for two sample cells: magnocellular and parvocellular. The parvocellular cell has sustained activity over an interval of 250 ms, whereas the magnocellular cell is active mostly over the first 100 ms. The magnocellular cell has more phasic responses (Golomb et al. 1994). Thus, we expect the multidimensional codes to capture a larger proportion of the information in its responses. A simple rate code is more likely to be an acceptable zeroth-order description of the parvocellular cell. The results we obtain (compare panels A and C, D and F, respectively, in Figure 1) bear this expectation out. All of the calculations show that the first principal component is more informative about the stimulus than the spike count. As expected, this effect

Estimating Information from Neuronal Responses

657

Figure 1: The information about a stimuli set of 32 Walsh figures conveyed by the neuronal response of model parvocellular (A–C) and magnocellular (D–F) cells, as estimated by various methods. The response is quantified by the number of spikes (A,D), the first principal components (B,E), and the first three principal components (C,F). The mutual information is estimated by straightforward equipopulated binning (dotted lines), equipopulated binning with finite sampling correction (dashed lines), and a neural network (solid line). The numbers of bins for each code and Ns are shown in Table 1. The neural network has six hidden units and a learning rate of 0.003. The arrows at the right indicate an asymptotic value (very good approximation for the “true” value) obtained with equispaced binning with 36 × 20 × 20 bins and 106 trials per stimulus.

658

David Golomb et al.

Figure 2: The information about a stimuli set of 32 Walsh figures conveyed by the neuronal response of model parvocellular (A) and magnocellular (B) cells. The response is quantified by the first three principal components with (solid line) and without (dotted line) the number of spikes. The mutual information is estimated by a neural network with six hidden units and a learning rate of 0.003. I (S; PC123) is shown also in Figures 1C and 1F.

is especially strong for the magnocellular cell, as the first PC weighting function suppresses contributions from the spikes after the first 100 ms, which are mainly noise. Figure 1 shows that the raw binning is strongly biased upward. The difference between the estimates made with raw and corrected binning almost does not vary with Ns , for PC1 and PC123. This is because the firstorder correction term (see equation 2.8) is approximately proportional to R/Ns , which we choose to keep roughly constant in our calculation. For NOS there is no point choosing R above 1+NOSmax ; hence the correction term, and with it the raw estimate, decreases as Ns is increased. Both the corrected binning method and the network method tend to underestimate the information in PC1 and PC123 (the only counterexample is shown in Figure 1E, where the binning method overestimated it). This is because both methods involve a regularization of the responses (explicit in the binning, and done implicitly by the network), and regularization always decreases the amount of information present in the raw response. For example, the corrected binning method underestimates the information whenever the number of bins is too small to capture important features of the probability distribution of the responses. Therefore, the effect is strong for PC123 when the number of bins in the direction of the first PC is not large enough, and the bias downward decreases with increasing Ns because the number of bins increases too. The underestimation does not occur with NOS

Estimating Information from Neuronal Responses

659

because the maximal number of spikes in our examples is around 30, and there is no meaning to using finer binning. In general, the underestimation due to regularization is more prominent for the higher-dimensional code (PC123) because the effect of adding more bins in each dimension is stronger when the number of bins is small. The neural network tends to find a larger amount of information carried by the first three principal components together with the spike count than by the first three principal components alone (Heller et al. 1995). For the parvocellular cell, I (S; PC123s) exceeds I (S; PC123) by 0.025 bit (2.6%) for Ns = 64 and by 0.008 bit (0.8%) for Ns = 128; for the magnocellular cell, the extra information is 0.03 (7%) for Ns = 64 and 0.04 bit (9%) for Ns = 128; compare the solid lines in Figures 1 and 2. This increase occurs despite the high correlation between the first principal component and the number of spikes, especially for the parvocellular cell. In Figure 3 we present the differences I (S; PC1) − I (S; NOS) between the information carried by PC1 and NOS, and I (S; PC123) − I (S; NOS), between PC123 and NOS. For all the cases considered here, the network yields a value for the extra information that is biased downward. This shows that the network automatically regularizes responses, and apparently the regularization is stronger for the higher-dimensional code. The corrected binning technique, on the other hand, gives both downward- and upwardbiased values for the extra information—in this instance, downward for the parvocellular cell. Since the choice of number of bins along each dimension is somewhat arbitrary for a multidimensional code, we checked the effect of using different binning schemes for Ns = 128. The results are summarized in Table 2A. The information differences for the parvocellular cells are in a 30% range; the information differences for the magnocellular cells are all nearly the same, no matter which method is used. Thus, even in the least favorable case, information differences using different binning schemes remain in the range of the remaining (downward) systematic error—about 30%. In a similar way, we varied the parameters of the network: the number of hidden units and the learning rate (see Table 2B). The difference in information varies within 10% for both cells, indicating that changes in these parameters are less important than the downward bias due to the regularization. The test error for the various network parameters is quite similar, with differences within 0.4% for both cell types. Thus, it is difficult to determine the best result of the network from just this number. 4 Discussion The main result of this work is that information carried in the firing of a single neuron during a certain time interval about a set of stimuli, and quantified by a certain code, can be estimated with a reasonable accuracy (within 0.05–0.1 bit, or 10%) using either of the two procedures described

660

David Golomb et al.

Figure 3: Differences between measures used for quantifying the response of a parvocellular cell (A,B) and a magnocellular cell (C,D). We show in A and C the difference between the information carried by the first PC and that carried by the number of spikes, and in B and D the difference between the information in the first three PCs and that in the number of spikes. Dashed lines indicate results obtained from normalized equipopulated binning; solid lines indicate those from the neural network. In A and C, the number of bins is 16 for N = 16 and 36 for N ≥ 32. In B and D the number of bins in the first three principal components is as given in Table 1, and the number of bins for NOS is equal to R(1) in that table, that is, to the number of bins for the first PC (4, 6, 7, and 8, respectively). The arrows at the right indicate an asymptotic value obtained with equispaced binning with 36 × 20 × 20 bins and 106 trials per stimulus.

above. This result is achieved with the temporal response quantified by a three-dimensional code (PC123). With a one-dimensional code (NOS or PC1) the accuracy is even better. The relatively good accuracy obtained with our methods enables one to compare various codes and determine which one carries the most information (e.g., Heller et al. 1995). To understand the results in more detail, it is important to remember that information measures are specific to the stimulus set considered, but also specific to the quantities chosen to quantify neuronal responses; also,

Estimating Information from Neuronal Responses

661

Table 2: Difference in Mutual Information I (S; PC123) − I (S; NOS) (Bits) for Ns = 128 A. Several Binning Schemes R = R(1) × R(2) × R(3)

7×6×3

8×4×4

10 × 4 × 3

15 × 3 × 3

Parvocellular cell Magnocellular cell

0.145 0.289

0.135 0.289

0.125 0.291

0.114 0.289

B. Several Network Schemes Hidden units, learning rate

6, 0.0003

6, 0.001

10, 0.001

6 , 0.003

Parvocellular cell Magnocellular cell

0.121 0.219

0.127 0.201

0.123 0.219

0.118 0.206

Note: The standard deviations of the difference are about 0.01.

information measures are affected by limited sampling, which results typically in an upward bias, but also, since it is always necessary to regularize continuous responses, they may be affected by the regularization, which results in a bias downward. When a simple binning of the responses was used, raw information measures were strongly biased upward, and thus it was important to apply a correction for limited sampling. If one follows this procedure, the only parameter that has to be set is R, the number of response bins. If R is chosen too large, subtracting the term C1 , equation 2.8, will not be enough to correct for limited sampling (see also Treves & Panzeri 1995); if it is chosen too small, a strong regularization will be imposed, and information will be underestimated. The results indicate that it is sensible to set the number of response bins at roughly the number of trials per stimulus available, R ' Ns . C1 is inversely proportional to the number of trials available and roughly directly proportional to the number of response bins, and this choice approximately balances the upward bias due to finite sampling with the downward one due to the regularization. Choosing the number of bins for each of the first three principal components in PC123 was more delicate that for NOS or PC1 and tended to yield a stronger downward bias in information values. This suggests that the use of the binning procedure alone becomes insufficient for higher-dimensional codes, as when the spike trains of several cells are considered together.1

1 One useful procedure for the computation of information from multiple neurons is to use decoding to extract the relative probabilities of the stimuli from the responses and thus to reduce the original set of responses to the size of the stimulus set (Gochin et al. 1994; Rolls et al. 1996a, 1996b).

662

David Golomb et al.

For all the cases considered, the network underestimated both the mutual information and the extra information in the temporal response in comparison to the number of spikes. This indicates that the regularization induced by the network is enough to dispose of the finite sampling bias, at the price of underestimating information values, especially for higher-dimensional codes, which are more strongly regularized.2 Information values generally increase weakly with N, which indicates that the regularization induced has decreasing effects as N becomes large. Since the underestimation is stronger for I(S;PC123) than in the case of unidimensional codes, estimates of the extra information in the second and third principal components (see Figure 3) are also biased downward. Several parameters need to be set when the network is used. Some (e.g., the number of iterations) can be set by crossvalidation. Others (e.g., the learning rate) have little effect on the results across a broad range of values, as indicated in Table 2. Another virtue of the network procedure is that it effectively incorporates a decoding step and as such can be immediately applied to high-dimensional (e.g., multiple single-unit) data. Although our results were obtained with simulated data, we believe that the conclusions apply to real data as well. In particular, our network results are consistent with those of Kjaer et al. (1994, Figure 10), obtained from complex V1 cells. They also found that the estimated mutual information increases with Ns and almost reaches a plateau for Ns ≈ 15 for the first principal component but not for the first three principal components. In this work, principal components are used for quantifying the data with a low-dimensional code, because the first principal components carry most of the difference among the responses to different stimuli. However, principal component analysis is not essential to our procedures for handling finite sampling problems; any kind of N-dimensional response extracted from the neuronal firing patterns can be used (e.g., PC123s, the NOS of several neurons, or neuronal spiking times; Heller et al. 1995). Estimating the accuracy of the methods can be done only for n ≤ 3, because for higher n, calculating the asymptotic value of the transmitted information using straightforward binning is too consuming of computer time. Our procedures of finite sampling corrections were demonstrated here on stimuli with a sharp onset in time. They are also applicable, however, to continuously changing stimuli (after a suitable discretization), as long as the response to each stimulus is measured during a fixed time interval T, and stimuli are either discrete or have been discretized. Important results have been obtained by Bialek and collaborators by extracting information measures from neuronal responses (Bialek 1991; Bialek et al. 1991; de Ruyter van Steveninck & Laughlin 1996; Rieke et al. 1993). These results are based on invertebrate (insect) data, which are usually easy

2

One could even compute the corresponding C1 term (Panzeri & Treves 1996).

Estimating Information from Neuronal Responses

663

to collect in large quantities, and thus the limited sampling artifacts are less severe. Our analysis is particularly relevant to primate data, which are typically much less abundant. Note, however, that a recent paper (Strong et al. 1996) analyzes finite sampling effects following the method of binning with finite size correction (Treves & Panzeri 1995). A second, less crucial difference between the type of experimental data considered here and that used by those investigators is that the stimulus in their case is a continuous quantity, which opens up the possibility of using notions such as the linearized limit and the gaussian approximation that do not apply to our situation with a discrete nonmetric set of stimuli and that when applicable can alleviate finite sampling effects further. The network procedure needs a long computational time. A typical calculation for Ns = 128 and 32 stimuli runs for about 7 CPU hours on an SGI-ONYX computer. The binning-and-correcting technique is much faster, and most of the CPU time is taken up by sorting responses in order to construct equipopulated bins. For I(S;NOS), which involves no sorting, a calculation with Ns = 128 runs for less than 1.8 seconds on an HP-Apollo computer. What is the minimum number of trials to plan for in an experiment? In our instance, results seemed reasonable when Ns = 32. However, this number depends on the stimuli, response representation, and neurons used. The argument sketched above indicates how to estimate the minimum number of trials in other experiments. The binning-and-correcting procedure functions reasonably up to Ns ' R, and the minimum R that may, if the appropriate type of response is chosen, not throw away much information,3 is R = S. Therefore a minimum of Ns = S trials per stimulus is a fair demand to be made on the design of experiments from which information estimates are going to be derived. Acknowledgments AT and SP have developed their procedure using data from the labs of Edmund Rolls, Bruce McNaughton, and Gabriele Biella, all of which was extremely useful. Partial support was from CNR and INFN (Italy). References Abbott, L. F., Rolls, E. T., & Tovee, M. J. (1996). Representational capacity of face coding in monkeys. Cerebral Cortex, 6, 498–505. Bialek, W. (1991). Optimal signal processing in the nervous system. In W. Bialek (Ed.), Princeton lectures on biophysics. London: World Scientific.

3 R = S is the effective R used when applying a decoding procedure, as with multiunit data by Rolls et al. (1996b).

664

David Golomb et al.

Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Carlton, A. G. (1969). On the bias of information estimate. Psych. Bull., 71, 108– 109. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. de Ruyter van Steveninck, R. R., & Laughlin, S. B. (1996). The rates of information transfer at graded-potential synapses. Nature, 379, 642–645. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci., 13, 2758– 2771. Gochin, P. M., Colombo, M., Dorfman, G. A., Gerstein, G. L., & Gross, C. G. (1994). Neural ensemble encoding in inferior temporal cortex. J. Neurophysiol., 71, 2325–2337. Golomb, D., Kleinfeld, D., Reid, R. C., Shapley, R. M., & Shraiman, B. I. (1994). On temporal codes and the spatiotemporal response of neurons in the lateral geniculate nucleus. J. Neurophysiol., 72, 2990–3003. Heller, J., Hertz, J. A., Kjaer, T. W., & Richmond, B. J. (1995). Information flow and temporal coding in primate pattern vision. J. Comp. Neurosci., 2, 175–193. Hertz, J. A., Kjaer, T. W., Eskander, E. N., & Richmond, B. J. (1992). Measuring natural neural processing with artificial neural networks. Int. J. Neural Syst., 3 (suppl.), 91–103. Kjaer, T. W., Hertz, J. A., & Richmond, B. J. (1994). Decoding cortical neuronal signals: Networks models, information estimation and spatial tuning. J. Comp. Neurosci., 1, 109–139. McClurkin, J. W., Optican, L. M., Richmond, B. J., & Gawne, T. J. (1991). Concurrent processing and complexity of temporally encoded neuronal messages in visual perception. Science, 253, 675–677. Miller, G. A. (1955). Note on the bias of information estimates. Info. Theory Psychol. Prob. Methods, II-B, 95–100. Optican, L. M., Gawne, T. J., Richmond, B. J., & Joseph, P. J. (1991). Unbiased measures of transmitted information and channel capacity from multivariate neuronal data. Biol. Cybernet., 65, 305–310. Optican, L. M., & Richmond, B. J. (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex: II. Quantification of response waveform. J. Neurophysiol., 57, 147–161. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107. Reid, R. C., & Shapley, R. M. (1992). Spatial structure of cone inputs to receptive fields in primate lateral geniculate nucleus. Nature, 356, 716–718. Richmond, B. J., & Optican, L. M. (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex: III. Information transmission. J. Neurophysiol., 57, 162–178. Rieke, F., Warland, D., & Bialek, W. (1993). Coding efficiency and information rates in sensory neurons. Europhys. Lett., 22, 151–156.

Estimating Information from Neuronal Responses

665

Rolls, E. T., Treves, A., Tov´ee, M. J., & Panzeri, S. (1996a). Information in the neuronal representation of individual stimuli in the primate temporal visual cortex. Submitted. Rolls, E. T., Treves, A., & Tov´ee, M. J. (1996b). The representational capacity of the distributed encoding of information provided by populations of neurons in the primate temporal visual cortex. Experimental Brain Research. In press. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1996). Entropy and information in neural spike trains. Los Alamos archives cond-mat 9603127. Submitted. Tov´ee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the response of single neurons in the primate temporal visual cortex. J. Neurophysiol., 70, 640–654. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Comp., 7, 399–407.

Received February 5, 1996; accepted June 24, 1996.

Communicated by Peter Dayan

Covariance Learning of Correlated Patterns in Competitive Networks Ali A. Minai Department of Electrical and Computer Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221 USA

Covariance learning is a powerful type of Hebbian learning, allowing both potentiation and depression of synaptic strength. It is used for associative memory in feedforward and recurrent neural network paradigms. This article describes a variant of covariance learning that works particularly well for correlated stimuli in feedforward networks with competitive K-of-N firing. The rule, which is nonlinear, has an intuitive mathematical interpretation, and simulations presented in this article demonstrate its utility. 1 Introduction Covariance-based synaptic modification for associative learning (Sejnowski 1977) is motivated mainly by the need for synaptic depression as well as potentiation. Long-term depression (LTD) has been reported in biological systems (Levy & Steward 1983; Stanton and Sejnowski 1989; Fujii et al. 1991; Dudek & Bear 1992; Mulkey and Malenka 1992; Malenka 1994; Selig et al. 1995) and is computationally necessary to avoid saturation of synaptic strength in associative neural networks. In the recurrent neural networks literature, covariance learning has been proposed for the storage of biased patterns, that is, binary patterns with probability(bit i = 1) 6= probability(bit i = 0) (Tsodyks and Feigel’man 1988), and this rule is provably optimal among linear rules for the storage of stimulus-response associations between randomly generated patterns under rather general conditions (Dayan & Willshaw 1991). However, the prescription is not as successful when patterns are correlated, that is, probability(bit i = 1) 6= probability(bit j = 1). The correlation causes greater similarity between patterns, making correct association or discrimination more difficult. I present a covariance-type learning rule that works much better in this situation (Green & Minai 1995, 1996). The motivation for looking at correlated patterns arises from their ubiquity in neural network settings. For example, if the patterns for an associative memory are characters of the English alphabet on an n×n grid, some grid locations (e.g., those representing a vertical bar on the left edge) will be much more active than others (e.g., those corresponding to a vertical bar on the Neural Computation 9, 667–681 (1997)

c 1997 Massachusetts Institute of Technology °

668

Ali A. Minai

right edge). The same argument applies to more complicated stimuli, such as images of human faces. This article mainly addresses another common source of correlation: processing of (possibly initially uncorrelated) patterns by one or more prior layers of neurons with fixed weights. This situation arises, for example, whenever the output of a “hidden” layer needs to be associated with a set of responses. The results also apply qualitatively to other correlated situations. The results presented here are limited to the case where all neural firing is competitive K-of-N. This type of firing is widely used in neural modeling to approximate “normalization” of activity in some brain regions by inhibitory interneurons (e.g., Sharp 1991; O’Reilly & McClelland 1994; Burgess et al. 1994). The relationship with the noncompetitive case (e.g., Dayan & Willshaw 1991) is discussed but is not the focus of this article. 2 Methods The learning problem considered here was to associate m stimulus-response pattern pairs in a two-layer feedforward network. The stimulus layer was denoted S and the response layer R. All stimulus patterns had NS bits with KS 1’s and NS − KS 0’s. Response patterns had NR bits with KR 1’s and NR −KR 0’s. This K-of-N firing scheme approximates a competitive situation where broadly directed afferent inhibition keeps overall activity stable. Such competition is likely to occur in the dentate gyrus and cornu ammonis (CA) regions of the mammalian hippocampus (Miles 1990; Minai & Levy 1994), which has led to the use of K-of-N firing in several hippocampal models (Sharp 1991; O’Reilly & McClelland 1994; Burgess et al. 1994). The outputs of stimulus neuron j and response neuron i are denoted xj and zi , respectively, and wij denotes the modifiable weight from j to i. The activation of response neuron i is given by X wij xj . (2.1) yi = j∈S

All learning was done off-line, that is, the weights were calculated explicitly over a given stimulus-response data set. Performance was tested by presenting each stimulus pattern to a trained network, producing an output through KR -of-NR firing, and comparing it with the correct response. A figure of merit for each network on a given data set was calculated as P=

h−q , KR − q

(2.2)

where h is the number of correctly fired neurons averaged over the whole data set and q ≡ KR 2 /NR is the expected number of correct firings for a random response with KR active bits. Thus, P = 1 indicated perfect recall, and P = 0 represented performance equivalent to random guessing.

Covariance Learning of Correlated Patterns in Competitive Networks

669

Figure 1: Histograms of individual neuron activities (i.e., fraction of patterns for which a neuron is 1) for four different levels of correlation. (a) Randomly generated patterns, no correlation. (b) 50 of 200 prestimulus neurons active per pattern ( fP = 0.25). (c) 100 of 200 prestimulus neurons active ( fP = 0.5). (d) 150 of 200 prestimulus neurons active ( fP = 0.75). The distribution of activities is close to normal for random patterns but becomes increasingly skewed with growing correlation.

Correlated stimulus patterns were generated as follows. First, an NS ×NP weight matrix V was generated, such that each entry in it was a 0-mean, unit variance uniform random variable. Then m independent KP -of-NP prestimulus patterns were generated, and each of these was then linearly transformed by V and the result thresholded by a KS -of-NS selection process to produce m stimulus patterns. The correlation arose because of the fixed V weights and could be controlled by changing the fP ≡ KP /NP fraction— termed the correlation parameter in this article. Figure 1 shows histograms of average activity for individual stimulus neurons in cases with different degrees of correlation. Correlated response patterns were generated similarly with another set of V weights. 3 Mathematical Motivation The learning rule proposed in this article is motivated directly by three existing rules: the Hebb rule (Hebb 1949), the presynaptic rule, and the co-

670

Ali A. Minai

variance rule. This motivation is developed systematically in this section. Several other related rules that are also part of this study are briefly discussed at the end of this section. For binary patterns, insight into various learning rules can be obtained through a probabilistic interpretation of weights. Consider the normalized Hebb rule, defined by wij = hxj zi i,

(3.1)

where h.i denotes an average over all m patterns. The weights in this case can be written as wij = P(xj = 1, zi = 1),

(3.2)

that is, the joint probability of pre- and postsynaptic firing. For correlated patterns, this is not a particularly good prescription. A significantly better learning rule is the presynaptic rule: wij =

hxj zi i hxj i

= P(zi = 1 | xj = 1).

(3.3)

This rule is the off-line equivalent of the on-line incremental prescription: wij (t + 1) = wij (t) + ²xj (t)(zi (t) − wij (t)),

(3.4)

hence, the designation “presynaptic.” This incremental rule has been investigated by several researchers (Minsky & Papert 1969; Grossberg 1976; Minai & Levy 1993; Willshaw et al. forthcoming). With this rule, the denµ dritic sum, yi , of output neuron i when the µth stimulus is presented to the network is given by X µ µ P(zi = 1 | xj = 1)xj , (3.5) yi = j

which sums the conditional firing probabilities for i over all input bits active in the µth stimulus. This rule works well in noncompetitive situations, where each output neuron sets its own firing threshold, and in competitive situations, where all output units have identical firing rates. However, in a competitive situation with correlated patterns, output neurons with high firing rates acquire large weights and lock out other output units, precluding proper recall of many patterns. A rule that addresses the problem of differences in output unit firing rates is the covariance rule, wij = hxj zi i − hxj ihzi i = P(xj = 1, zi = 1) − P(xj = 1)P(zi = 1) = P(xj = 1)[P(zi = 1 | xj = 1) − P(zi = 1)],

(3.6)

Covariance Learning of Correlated Patterns in Competitive Networks

671

which accumulates the conditional increment in probability of firing over active stimulus bits: µ

yi =

X

µ

P(xj = 1)[P(zi = 1 | xj = 1) − P(zi = 1)]xj .

(3.7)

j

However, it also scales the incremental probabilities by P(xj = 1), which means that highly active stimulus bits contribute more to the dendritic sum than less active ones, even if both bits imply the same increment in firing probability. This is not a problem when all stimulus units have the same firing rate, but it significantly reduces performance for correlated stimuli. The analysis of the presynaptic and covariance rules immediately suggests a hybrid that combines their strengths and reduces their weaknesses. This is the proposed presynaptic covariance rule: hxj zi i − hxj ihzi i hxj i hxj zi i − hzi i = hxj i

wij =

= P(zi = 1 | xj = 1) − P(zi = 1).

(3.8)

Now the dendritic sum simply accumulates incremental firing probabilities over all active inputs, µ

yi =

X µ [P(zi = 1 | xj = 1) − P(zi = 1)]xj ,

(3.9)

j

and the KR output units with the highest accumulated increments fire. It can be shown easily that in the K-of-N situation, the mean dendritic sums for the presynaptic, covariance, and presynaptic covariance rules are: Presynaptic : hyi i = KS hzi i X hxj iCovij Covariance : hyi i = j

Presynaptic covariance : hyi i = 0

(3.10)

where KS is the number of active bits in each stimulus and Covij is the sample covariance of xj and zi . Clearly the presynaptic covariance rule “centers” the dendritic sum to produce fair competition in a correlated situation. Note, however, that there are limits to how much correlation any rule can tolerate; this issue is addressed in section 5.

672

Ali A. Minai

3.1 Other Rules. It is also instructive to compare learning rules other than those mentioned above with the presynaptic covariance rule. These are: 1. Tsodyks-Feigel’man (TF) rule: wij = h(xj − p)(zi − r)i, P P where p = NS −1 j∈S hxj i and r = NR −1 i∈R hzi i.

(3.11)

2. Postsynaptic covariance rule: wij =

hxj zi i − hxj ihzi i . hzi i

(3.12)

3. Willshaw rule: wij = H(hxj zi i),

(3.13)

where H(y) is the Heaviside function. 4. Correlation coefficient rule: wij =

hxj zi i − hxj ihzi i , σj σi

(3.14)

where σj and σi are sample variances of xj and zi , respectively. The TF rule differs from the covariance rule in its use of global firing rates p and r rather than separate ones for each neuron. The two rules are identical for uncorrelated patterns. Dayan and Willshaw (1991) have shown that the TF rule is optimal among all linear rules for noncompetitive networks and uncorrelated patterns. The postsynaptic covariance rule was included in the list mainly because of its superficial symmetry with the presynaptic covariance rule. In fact, the postsynaptic covariance rule sets weights as wij =

hxj zi i −hxj i = P(xj = 1 | zi = 1)−P(xj = 1) = hxj | zi = 1i−hxj i. (3.15) hzi i

Thus, the weights to neuron i come to encode the centroid of all stimulus patterns that fire i, offset by the average vector of all stimuli. Essentially this rule corresponds to a biased version of Kohonen-style learning. It is likely to be useful mainly in clustering applications. The correlation coefficient rule is included because it represents a properly normalized version of the standard covariance rule. The weights are, by definition, correlation coefficients, so the firing of neuron i is based on how positively it correlates with the given stimulus pattern. Finally, the inclusion of the Willshaw rule (Willshaw, Buneman, & LonguetHiggins 1969) reflects its wide use, elegant simplicity, and extremely good performance for sparse patterns.

Covariance Learning of Correlated Patterns in Competitive Networks

673

4 Simulation Results Three different aspects were evaluated by simulations reported here. Unless otherwise indicated, both stimulus and response layers had 200 neurons. NP was also fixed at 200. All data points were averaged over ten independent runs, each with a new data set. In each run, all rules were applied to the same data set in order to get a fair comparison. 4.1 Pattern Density. In this set of simulations, the number of active stimulus bits, KS , and active response bits, KR , for each pattern were varied systematically and the performance of all rules evaluated in each situation. In each run, the stimuli and the responses were correlated through separate, randomly generated weights, as described above. The number of associations was fixed at 200. Both stimulus and response patterns were correlated with fP = 0.25. The results are shown in Table 1. It is clear that the presynaptic covariance rule performs better than any other rule over the whole range of activities. At low stimulus activity, the presynaptic covariance rule is significantly better than the standard covariance rule over the entire range of response activities. As stimulus activity increases, the performance of the presynaptic covariance falls to that of the standard rule. This is because the effect of division by hxj i is attenuated as all average stimulus activities approach 1, making the weights under the presynaptic covariance rule almost uniformly scaled versions of those produced by the standard rule. This suggests that the best situation for using the presynaptic covariance rule is with sparse stimuli. As expected, results with all other rules were disappointing. 4.2 Effect of Correlation. The correlated stimuli and responses used for simulations were generated from random patterns with activity fP = KP /NP via fixed random weights. The higher the fraction fP was, the more correlated were the patterns (see Figure 1). To evaluate the impact of different levels of correlation on learning with the presynaptic, covariance, and presynaptic covariance rules, three values of fP —0.25, 0.5, and 0.75—were investigated. In all cases, NP was fixed at 200, and network stimuli and responses were 10/200—quite sparse. The results are shown in Figure 2. Figure 2a shows the effect of stimulus correlation with uncorrelated response patterns. Clearly all three rules lose some performance, as would be expected from the fact that stimulus correlation corresponds to a loss of information about the outputs. However, the presynaptic covariance rule is significantly superior to the covariance rule. The presynaptic rule, which should theoretically have the same performance as the presynaptic covariance rule in this case, is affected by the residual response correlation introduced by finite sample size. Figure 2b shows the impact of response correlation. As expected from theoretical analysis, the covariance and presynaptic covariance rules have

10/10 0.6012 ±0.0131 0.4337 ±0.0207 0.4717 ±0.0284 0.4006 ±0.0123 0.4467 ±0.0236 0.4340 ±0.0214 0.2856 ±0.0243 0.5276 ±0.0163

10/100 0.8236 ±0.0068 0.6700 ±0.0192 0.4320 ±0.0219 0.2768 ±0.0179 0.3439 ±0.0252 0.6698 ±0.0203 0.1247 ±0.0231 0.6452 ±0.0182

10/190 0.8032 ±0.0056 0.7936 ±0.0086 0.2116 ±0.0172 0.2051 ±0.0187 0.2723 ±0.0156 0.2860 ±0.0404 0.1187 ±0.0102 0.6654 ±0.0165

100/10 0.5202 ±0.0091 0.5117 ±0.0133 0.3434 ±0.0157 0.3480 ±0.0183 0.3842 ±0.0188 0.5112 ±0.0138 0.0082 ±0.0296 0.5075 ±0.0136

100/100 0.7969 ±0.0092 0.7861 ±0.0115 0.2163 ±0.0321 0.2117 ±0.0330 0.2719 ±0.0215 0.7926 ±0.0083 0.0066 ±0.0164 0.6570 ±0.0338

100/190 0.6518 ±0.0363 0.6776 ±0.0302 0.2082 ±0.0120 0.2087 ±0.0120 0.3596 ±0.0476 0.2198 ±0.0206 0.0227 ±0.0211 0.3190 ±0.0378

190/10 0.4182 ±0.0180 0.4358 ±0.0187 0.3485 ±0.0158 0.3485 ±0.0159 0.4558 ±0.0090 0.4362 ±0.0184 −0.0014 ±0.0220 0.2567 ±0.0412

190/100

0.6623 ±0.0272 0.6805 ±0.0232 0.2035 ±0.0188 0.2131 ±0.0243 0.3673 ±0.0270 0.6854 ±0.0247 0.0074 ±0.0270 0.3600 ±0.0387

190/190

Note: The network has NS = NR = 200, and 200 stored associations in each run. Each rule was evaluated in nine situations: stimulus activities of 10, 100, and 190 out of 200, each at response activities of 10, 100, and 190 out of 200. These are indicated in the top row of the table by stimulus/response activity. The ± entries indicate sample standard deviation in each case. NP was set at 200 and KP at 50 for both stimuli and responses. Each data point is averaged over ten independent runs.

0.8211 ±0.0129 Covariance 0.6676 ±0.0295 Presynaptic 0.4213 ±0.0156 Hebb 0.2854 ±0.0326 Tsodyks-Feigel’man 0.3675 ±0.0535 Postsynaptic covariance 0.2157 ±0.0232 Willshaw 0.6761 ±0.0264 Correlation coefficient 0.6306 ±0.0263

Presynaptic covariance

Rule

Stimulus/response activity

Table 1: Performance of Each Rule at Different Levels of Stimulus and Response Coding Densities

674 Ali A. Minai

Covariance Learning of Correlated Patterns in Competitive Networks

675

Figure 2: The performance of the learning rules at different correlation levels for stimulus and response correlations. In each graph, the × symbol indicates the performance of the presynaptic covariance rule, the solid line the performance of the standard covariance rule, and the dotted line that of the presynaptic rule. (a) The case of correlated stimuli and uncorrelated responses. (b) The results for correlated responses and uncorrelated stimuli. In all cases, stimulus and response activities were set at 10-of-200. NP was set at 200 and KP at 50. Each data point was averaged over ten independent runs. The network had NS = NR = 200, and 200 associations were stored during each run.

almost identical performance, while the presynaptic rule is significantly worse. An interesting feature, however, is the U-shape of the performance curve for the presynaptic rule, which reflects the rule’s inherent propensity toward clustering. This is discussed in greater detail below. 4.3 Capacity. To evaluate the capacity of the various learning rules, increasingly large sets of associations were stored in networks of sizes 100, 200, and 400. In each case, the size referred to the number of stimulus or response neurons, which was the same. The correlation parameter, fP , was fixed at 0.25 and stimulus and response activity at 5% active bits per pattern. The results are shown in Figures 3 and 4 for correlated stimuli and responses. Again, the presynaptic covariance rule did better than all others at all sizes and loadings, and for all situations. Figure 3d shows how capacity scales with network size for the presynaptic covariance rule. Each line represents the scaling for a specific performance threshold. Clearly, the scaling is linear in all cases, though it is not clear whether this holds for all levels of correlation. Figure 4 shows how the slope of the size-capacity plot changes with performance threshold. The results for the case with correlated stimulus only are also plotted for comparison. The theoretical analysis of these results will be presented in future papers.

676

Ali A. Minai

Figure 3: The performance of the learning rules for data sets and networks of different sizes. In graphs a–c, the dashed line shows the performance of the Willshaw rule, the dot-dashed line that of the TF rule, and the dotted line the performance of the normalized Hebb rule. The performance of the presynaptic covariance rule is indicated by × and that of the standard covariance rule by ◦. Graph d shows the empirically determined dependence of the presynaptic rule’s capacity as a function of network size for performance levels of 0.95(×), 0.9(◦), 0.8(∗), and 0.75(+). In all cases, stimulus and response activities were set at 10-of-200. NP was set at 200 and KP at 50 for both stimuli and responses. Each data point was averaged over ten independent runs.

5 Discussion It is instructive to consider the semantics of the presynaptic covariance rule vis-`a-vis those of the standard covariance rule in more detail. The weights produced by the presynaptic covariance rule may be written as wij = P(zi = 1 | xj = 1) − P(zi = 1) = hzi | xj = 1i − hzi i.

(5.1)

Thus, a positive weight means that xj = 1 is associated with higher-thanaverage firing of response neuron i, and a negative weight indicates the reverse. In this way, effects due to disparate activities of stimulus neurons are mitigated, and the firing of response neurons depends on only the average incremental excitatory or inhibitory significance of individual stimulus

Covariance Learning of Correlated Patterns in Competitive Networks

677

Figure 4: The graph shows the slope of the capacity-size line for the presynaptic covariance rule (cf. Figure 3d) as a function of the performance level used to determine capacity. The curve with × markers corresponds to the case reported in Figure 3, and the curve with ◦ markers is for correlated stimuli and uncorrelated responses. The latter is included for comparison purposes. All parameters are as in Figure 3.

neurons, not just on covariance. In a sense, “cautious” stimulus neurons— those that fire seldom, but when they do fire, reliably indicate response firing—are preferred over “eager” stimulus neurons—those that fire more often but often without concomitant response firing. This is very valuable in a competitive firing situation. To take a somewhat extreme illustrative example, consider a response neuron i with hzi i = 0.5, and two stimulus bits x1 and x2 with hx1 i = 0.1, hx1 zi i = 0.1, and hx2 i = 0.99, hx2 zi i = 0.5. Thus, x1 = 1 ⇒ zi = 1 and zi = 1 ⇒ x2 = 1 (or, alternatively, x2 = 0 ⇒ zi = 0). However, x2 = 1 does not imply zi = 1. Under the covariance rule, wi1 = wi2 = 0.05, even though x1 = 1 is a reliable indicator of zi = 1 while x2 = 1 is not. The presynaptic covariance rule sets wi1 = 0.5 and wi2 = 0.0505, reflecting the increase in the probability of zi = 1 based on each stimulus bit (x1 = 1 raises the probability of zi = 1 to 1.0, while x2 = 1 raises it only to 0.5505.) In this way, each stimulus bit is weighted by its reliability in predicting firing. The

678

Ali A. Minai

information x = 1 ⇒ z = 1 is extremely valuable even if x = 0 does not imply z = 0, while x = 0 ⇒ z = 0 without x = 1 ⇒ z = 1 is of only marginal value. Simple covariance treats these two situations symmetrically, whereas in fact they are quite different. The presynaptic covariance rule rectifies this. However, a price is paid in that correct response firing becomes dependent on just a few cautious stimulus bits, creating greater potential sensitivity to noise. It should be noted that the asymmetry of significance between the two situations just described arises because the usual linear summation of activation treats x = 1 and x = 0 asymmetrically, discarding the “don’t fire” information in x = 0. The comparison with the presynaptic rule is also interesting. As shown in Figure 2b, the presynaptic rule has a U-shaped dependence on output correlation, while both covariance rules have a monotonically decreasing one. This reflects the basic difference in the semantics of the two rule groups. In the presence of output correlation and competitive firing, the presynaptic rule, like the Hebb rule, has a tendency to cluster stimulus patterns—that is, to fire the same output units for several stimuli. These “winner” outputs are the neurons with a high average activity. This is a major liability if there are other output neurons with moderately lower average activity, since their firing is depressed disproportionately. However, as output correlation grows to a point where extremely disproportionate firing is the actual situation, the clustering effect of both the presynaptic and the Hebb rules ceases to be a liability, and they recover performance. In the extreme case where some response neurons fire for all stimuli and the rest for none, the presynaptic and Hebb rules are perfect. In contrast, the covariance rules fail completely in this situation because all covariances are 0. The main point, however, is that for all but the most extreme response correlation, the presynaptic covariance rule is far better than the presynaptic or the Hebb rules. The presynaptic and the presynaptic covariance rules are identical in the noncompetitive situation with adaptive firing thresholds. 5.1 Competitive versus Noncompetitive Firing. The results of this article apply only to the competitive firing situation. However, it is interesting to consider briefly the commonly encountered noncompetitive situation where each response neuron fires independently based on whether its dendritic sum exceeds a threshold. By assuming that thresholds are adaptive, an elegant signal-to-noise ratio (SNR) analysis can be done to evaluate the performance of different learning rules for threshold neurons (Dayan & Willshaw 1991). This SNR analysis is based on the dendritic sum distributions for the cases where the neuron fires (the “high” distribution) and those where it does not (the “low” distribution). The optimal firing threshold for the neuron can then be found easily. The key facilitating assumption in SNR analysis is that each response neuron fires independently, but this is not so in the competitive case. Indeed, for competitive firing, there is no fixed threshold for any response neuron;

Covariance Learning of Correlated Patterns in Competitive Networks

679

Figure 5: Errors of omission and commission by the presynaptic covariance and the standard covariance rule in a single run with noncompetitive firing. The threshold for each neuron was set at the empirically determined intersection point of its “low” and “high” dendritic sum distributions. Graphs a and b show the histograms for errors of commission (a) and omission (b) for the presynaptic covariance rule, and graphs c and d show the same for the standard covariance rule. The networks have NS = NR = 400, and KS = KR = 50. The correlation parameter is 0.25 for both stimuli and responses, and 200 associations are stored in each case.

an effective threshold is determined implicitly for each stimulus pattern depending on the dendritic sums of all other response neurons in the competitive group. Thus, the distribution of a single neuron’s dendritic sum over all stimulus patterns says little about the neuron’s performance. It is, rather, the distribution of the dendritic sums of all neurons over each stimulus pattern that is relevant. The “high” distribution for stimulus k covers the dendritic sums of response neurons that should fire in response to stimulus k, and the “low” distribution covers the dendritic sums of those response neurons that should not fire. The effective firing threshold for stimulus pattern k is then determined by the dendritic sum of the KR th most excited neuron. In a sense, there is a duality between the competitive and noncompetitive situations. In the former, the goal is to ensure that the correct neurons win for each pattern; in the latter, the concern is that the correct patterns fire a given neuron. Figure 5 illustrates the results of comparing the errors committed by the covariance and presynaptic covariance rules on the same learning problem

680

Ali A. Minai

in the non-competitive case. The network has 400 neurons in both layers and stores 200 associations. Activity was set at 50 (25%) to ensure reasonable sample size, and fP was set to 0.25. The threshold was set at the intersection point of the empirically determined “high” and “low” dendritic sum distributions. This is optimal for gaussian distributions. Errors of omission (0 instead of 1) and commission (1 instead of 0) are shown separately. Although Figure 5 is based on only a single run, it does suggest that the presynaptic covariance rule improves on the covariance rule in the noncompetitive case as well.

6 Conclusion The results presented here demonstrate that for correlated patterns in competitive networks, the presynaptic covariance rule is clearly superior to the presynaptic rule or the standard covariance rule when the stimulus patterns are relatively sparsely coded (average activity < 0.5). This is true with respect to capacity and performance. These conclusions need to be formally verified through mathematical analysis and more simulations. The results of this work will be reported in future papers. For uncorrelated stimulus patterns, simulations (not shown) indicate that the standard covariance rules has virtually the same performance as the presynaptic covariance rule.

References Burgess, N., Recce, M., & O’Keefe, J. (1994). A model of hippocampal function. Neural Networks, 7, 1065–1081. Dayan, P., & Willshaw, D. J. (1991). Optimising synaptic learning rules in linear associative memories. Biol. Cybernetics, 65, 253–265. Dudek, S. M., & Bear, M. F. (1992). Homosynaptic long-term depression in area CA1 of hippocampus and effects of N-methyl-D-aspartate receptor blockade. Proc. Natl. Acad. Sci. USA, 89, 4363–4367. Fujii, S., Saito, K., Miyakawa, H., Ito, K-I., & Kato, H. (1991). Reversal of longterm potentiation (depotentiation) induced by tetanus stimulation of the input to CA1 neurons of guinea pig hippocampal slices. Brain Res., 555, 112–122. Green, T. S., & Minai, A. A. (1996). A covariance-based learning rule for the hippocampus. In J. M. Bower (Ed.), Computational Neuroscience (pp. 21–26). San Diego: Academic Press. Green, T. S., & Minai, A. A. (1995). Place learning with a stable covariance-based learning rule. Proc. WCNN 1, 573–576. Grossberg, S. (1976). Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. Biol. Cybernetics, 23, 121–134. Hebb, D. O. (1949). The organization of behavior. New York: Wiley.

Covariance Learning of Correlated Patterns in Competitive Networks

681

Levy, W. B., & Steward, O. (1983). Temporal contiguity requirements for longterm associative potentiation/depression in the hippocampus. Neuroscience, 8, 791–797. Malenka, R. C. (1994). Synaptic plasticity in the hippocampus: LTP and LTD. Cell, 78, 535–538. Miles, R. (1990). Variation in strength of inhibitory synapses in the CA3 region of guinea-pig hippocampus (in vitro). J. Physiol., 431, 659–676. Minai, A. A., & Levy, W. B. (1993). Sequence learning in a single trial. Proc. WCNN 2, 505–508. Minai, A. A., & Levy, W. B. (1994). Setting the activity level in sparse random networks. Neural Computation, 6, 85–99. Minsky, M., & Papert, S. (1969). Perceptrons. Cambridge, MA: MIT Press. Mulkey, R. M., & Malenka, R. C. (1992). Mechanisms underlying induction of homosynaptic long-term depression in area CA1 of the hippocampus. Neuron, 9, 967–975. O’Reilly, R. C., & McClelland, J. L. (1994). Hippocampal conjunctive encoding, storage, and recall: Avoiding a tradeoff. Hippocampus, 6, 661–682. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol. 4, 303–321. Selig, D. K., Hjelmstand, G. O., Herron, C., Nicoll, R. A., & Malenka, R. C. (1995). Independent mechanisms for long-term depression of AMPA and NMDA responses. Neuron, 15, 417–426. Sharp, P. E. (1991). Computer simulation of hippocampal place cells. Psychobiology, 19, 103–115. Stanton, P. K., & Sejnowski, T. J. (1989). Associative long-term depression in the hippocampus induced by Hebbian covariance. Nature, 339, 215–218. Tsodyks, M. V., & Feigel’man, M. V. (1988). The enhanced storage capacity in neural networks with low activity level. Europhys. Lett., 6, 101–105. Willshaw, D. J., Buneman, O. P., & Longuet-Higgins, H. C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Willshaw, D., Hallam, J., Gingell, S., & Lau, S. L. (Forthcoming). Marr’s theory of the neocortex as a self-organizing neural network. Neural Computation.

Received January 4, 1996; accepted June 28, 1996.

Communicated by Christopher Bishop

A Mobile Robot That Learns Its Place Sageev Oore Geoffrey E. Hinton Department of Computer Science, University of Toronto, Toronto, Canada M5S 1A4

Gregory Dudek School of Computer Science, McGill University, Montreal, Canada H3A 2A7

We show how a neural network can be used to allow a mobile robot to derive an accurate estimate of its location from noisy sonar sensors and noisy motion information. The robot’s model of its location is in the form of a probability distribution across a grid of possible locations. This distribution is updated using both the motion information and the predictions of a neural network that maps locations into likelihood distributions across possible sonar readings. By predicting sonar readings from locations, rather than vice versa, the robot can handle the very nongaussian noise in the sonar sensors. By using the constraint provided by the noisy motion information, the robot can use previous readings to improve its estimate of its current location. By treating the resulting estimates as if they were correct, the robot can learn the relationship between location and sonar readings without requiring an external supervision signal that specifies the actual location of the robot. It can learn to locate itself in a new environment with almost no supervision, and it can maintain its location ability even when the environment is nonstationary. 1 Introduction We describe a method that allows a robot equipped with a ring of sonar range finders to learn to compute its location within an initially unfamiliar environment. The obvious way to solve this problem is to build an explicit map of the environment. An interesting alternative, however, is for the knowledge of the environment to be implicit in the weights of a neural network that relates sonar readings to locations. The robot, of small trash-can proportions, has internal sensors that detect the movement in the wheels. These motion sensors may not be entirely accurate (they are blind to skidding) and therefore are not sufficient for accurate navigation over long periods of time. Each of the 12 uniformly spaced Polaroid sonar transducers crowning the robot acts as both transmitter and Neural Computation 9, 683–699 (1997)

c 1997 Massachusetts Institute of Technology °

684

Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek

receiver, measuring the return time of the first echo from an emitted chirp. The time is converted into a distance r, and the vector r of all 12 readings obtained at a point is called a sonar signature. The sonar sensings, too, are affected by various forms of noise, both reproducible and irreproducible. For example, the sound emitted by each sensor spreads out in a cone, and rays within the cone may make different numbers of bounces before returning. This can lead to phantom walls and phantom holes in corners (Dudek, Jenkin, Milios, & Wilkes, 1993; Wilkes, Dudek, Jenkin, & Milios, 1990), depending on the configuration and reflective properties of objects in the room. In addition, faulty sensors, cross-talk between transducers, energy dissipation of the sonar signal, unreturned echos, and very nearby objects can lead to irreproducible random effects that have no apparent structure. Moreover, even the accurate sensings can be assumed to have approximately gaussian additive noise. In this work, we have used simulator-based data, as will be described in section 3.1, and have assumed that the robot’s orientation is known. 2 Learning a Model That Relates Locations to Sonar Readings 2.1 Predicting the Location from the Sonar Readings. There is an obvious way of using a feedforward neural network to derive a location from a set of sonar readings. The inputs to the net are the sonar readings, and the outputs are the location, which may be represented as two real-valued coordinates or may be a probability distribution over a grid of possible locations. Although this type of net works moderately well (Dudek & Hinton 1992), it has some major drawbacks. During the initial training, the network requires supervision in the form of accurate information about the true location of the robot, and further supervision is required whenever the environment changes. The network is also poor at handling situations in which a sonar sensor is unreliable or a small change in location causes the sensor to return the maximum value rather than the distance to a small, close surface. When this happens, the input to the network changes a lot, but the outputs should change only slightly. Such slight changes are hard to achieve in a standard feedforward network because the outputs must be sensitive to at least some of the inputs. 2.2 Integrating Sonar Information over Time. Instead of using a neural network to predict locations from sonar readings, we could predict in the opposite direction and then use Bayes’ theorem. For each sonar signature, r, the same feedforward neural network is applied once at each possible grid cell, gi , to estimate the likelihood, P(r | gi ), of obtaining that signature from that cell. Instead of integrating over all possible locations within each cell, the neural net takes as input the location, xi , of the center of the cell and produces as output a conditional probability distribution under which the density of r is computed. The likelihood of r at each cell is then multi-

A Mobile Robot that Learns its Place

685

plied by the robot’s prior probability of being at the cell, and the products are renormalized1 to obtain a posterior probability distribution that incorporates the information in the current sonar readings. In the case that we know the environment is stationary (a common assumption), we can improve the efficiency of the tracking algorithm by a large factor, once the neural network has been trained, by keeping the density distribution functions P(r | gi ) in a lookup table, and thus eliminating the need for any more forward passes through the network: Ptposterior (gi )

P(rt | gi )Ptprior (gi ) . = P t t j P(r | gj )Pprior (gj )

(2.1)

To make the updating straightforward, we need to assume that the prior probabilities of cells are independent of the likelihoods obtained for the current sonar readings. This independence assumption is dubious because the prior probability distribution was derived using the same neural network and the same sensors on earlier nearby locations. To see just how bad the independence assumption can be, imagine what would happen if the robot stayed at exactly the same location and kept repeating its sonar readings. It would eventually be certain that it was in the cell that had the highest likelihood of producing the observed distribution of sonar signatures, even if many other cells had a likelihood almost as high. It is well known that maximum likelihood estimation will yield a biased prediction of the variance of a distribution, which could further increase the chances of confidence in an incorrect position estimate. To check for this effect, we tried using Mackay’s (1991) unbiased variance prediction method and found that the results did not differ significantly. The prior probability distribution is obtained by taking the previous posterior distribution and updating it using the noisy movement information.2 To eliminate the increase in entropy of the distribution caused by quantization effects when the movement in each direction is not an integral multiple of the grid spacing, we separate the expected movement in each dimension into an integral and a fractional part. The fractional part, f t−1,t , is accumulated by moving the location of every grid cell: + f t−1,t . xti = xt−1 i

(2.2)

1 By choosing a fine enough grid, we can assume that the density of the position distribution is locally linear within a cell, and therefore the probability mass contained in the cell is well approximated by using the probability density of a reading being obtained at the center of a cell. 2 This approach resembles Kalman filtering, but the probability distribution over the underlying state is nongaussian, and the model that relates the underlying state to the observations is nonlinear.

686

Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek

We can imagine this procedure as manipulating a single vector offset associated with the entire grid. For example, suppose each grid cell is 10 cm wide, and a motion of 23 cm to the east is perceived. Then the probabilities in each cell are moved over by two cells, and furthermore the offset of the entire grid is increased by 3 cm (i.e., the centers of the individual cells are themselves displaced). If the offset was already 9 cm, then we would instead move the probabilities over by three cells and subtract 7 cm from the offset, to keep the offset between zero and one cell width. After moving the probabilities over and adjusting the grid offset, we blur the distribution by the assumed noise, σ t−1,t in the movement estimate, Ptprior (gi ) =

X

wji Pt−1 posterior (gj ),

(2.3)

j

where the wji are the terms in a discrete approximation to a gaussian convolution kernel. In our simulation, the actual motion of the robot is a random sample from a gaussian with mean µt−1,t and standard deviation σ t−1,t . We have not added orientation noise to the motion model. When the boundaries of the environment are known, any probability of being outside the grid can be accumulated in the nearest edge cell. In this way, continual motion in the same direction will eventually lead to near certainty of being somewhere along the corresponding edge. 2.3 Representing the Robot in the Environment. The most convenient and natural way of representing the environment given the nature of our localization method is in the form of a grid, where a value P(gi ) is associated with each cell representing the probability of the robot being in that cell. This is reminiscent of Moravec’s (1988) certainty grid but with a key difference: The typical certainty grid lends itself well to constructing a map of the environment, and has been used effectively for that purpose (e.g., see Thrun et al., in press). In our approach the sensor readings are used immediately to update the estimate of the robot’s position; using other representations, there needs to be a way of making combined use of the map and the sensor readings to compute the location of the robot within that map (e.g., by modeling the inverse sonar process or optimizing a template match). In our approach the structure of the environment remains implicit in the models that assign probability distributions to sonar readings at each grid location. The integration over time of our knowledge of the environment is handled implicitly in the learning, whereas the integration of motion over time is handled explicitly by moving the probability distribution. 2.4 Learning to Predict Sonar Readings from Location. We considered several different types of networks for estimating the likelihoods of the sonar readings. All of these nets had two input units to represent location coordinates, and the output units were divided into 12 groups, with each

A Mobile Robot that Learns its Place

687

group specifying the parameters of a probability distribution for the readings from one sonar sensor. When the true location of the robot was known, the net was trained to maximize the log probability density of the actual sonar readings under the distribution represented by the output values of the net when the inputs were the true location. Initially we used a single hidden layer of logistic units and modeled the probability distribution for each sonar sensor as a mixture of a gaussian and a uniform distribution between 0 and the maximum value of 4 m. This required three output units per sonar sensor: a logistic unit for the mixing proportion of the uniform distribution, a linear unit for the mean of the gaussian, and an exponential unit for its variance. The use of an exponential is appropriate for a scale parameter like the variance, and it prevents the variance from ever becoming negative or zero. In our initial model, the predicted mean of the gaussian could be a nonlinear function of the two input coordinates because we used logistic hidden units. Later we found that it was better to use a different type of nonlinearity to predict the mean of the gaussian from the input coordinates. The mean is typically a linear function of the input coordinates over a small region but then switches to a different linear function when the sonar return comes from a different planar surface (see Fig. 1), so it is natural to use a mixture of linear models for predicting the mean. We used 12 separate networks, one for each sonar sensor; each such network, n, implemented a mixture of local models. Each local model was itself a mixture of a gaussian and a uniform. The mixing proportions of the local models were a function of the current input coordinates and were specified by the activity levels of a group of normalized radial basis units. The activity level of each basis unit, j, was determined by the relative proximity of its center, cnj , to the two input coordinates, x: · ¸ 2 −kx−cnj k exp 2 2σRBFn · ¸. (2.4) ψnj (x) = P −kx−cnk k2 exp k 2σ 2 RBFn

2 were learned.3 Both the centers of the units and their shared variance σRBF n Each basis unit has three associated output units that represent the parameters of a predicted probability distribution for one sonar reading. This distribution is a mixture of a gaussian and a uniform. The basis unit has one adaptive outgoing weight that determines the mixing proportion, zj , of the gaussian, one adaptive weight for the log of the variance, and three more

3 The planar walls of the room cause the piecewise linear regions in Figure 1 to have linear boundaries. This makes it appropriate for the radial basis functions to have the same variances because this ensures that, after normalization, the region dominated by an RBF has linear boundaries.

688

Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek

Figure 1: Readings from sensor 2. East-west and north-south positions are plotted along the x- and y-axes, respectively. Signatures were taken at every 2.5 cm inside a 2 m × 1 m rectangular region in the middle of a cluttered room. The measurements obtained by sensor 2 are plotted on the vertical axis. The measurements that drop below the 0 plane indicate locations where the sonar pulse did not return a detectable echo. Many of the vertical surfaces in the room are (nearly) orthogonal to the x- or y-axes, but sensor 2 is not parallel to either axis of the room. So sensor 2 tends to produce signals that have more multiple reflections and therefore a greater number of smaller discontinuous pieces.

outgoing weights, a, b, that define a linear model, which predicts the mean of the gaussian from the input coordinates. A picture of the network for a single sensor is shown in Figure 2. The likelihood distribution produced by basis unit j is: " # −krn − anj x − bnj k2 1 − znj 1 exp . + Pnj (rn | x) = znj rmax 2πσnj2 2σnj2

(2.5)

We have allowed each region an adaptive uniform noise mixing component because it is possible that the noisy, or unpredictable, readings are a function not only of the sensor but of the location of the robot as well. The normalized activity levels of the basis units are treated as mixing proportions for combining their likelihood distributions into a single likelihood

A Mobile Robot that Learns its Place

689

Figure 2: Final network model for one sonar sensor. The upper network shows 2 the normalized radial basis function units. A single variance σRBF is inferred for all regions of each sensor model. For each of the M regions, a linear unit is used to predict the mean of the gaussian Li (x), exponential units are used to learn variance, and sigmoid units are used to learn mixing proportions for uniform noise.

distribution for each sensor n: X Pnj (rn | x)ψnj (x). Pn (rn | x) =

(2.6)

j

The number of normalized radial basis functions (RBFs) needed to reach a certain positional accuracy is related to the size and complexity of the environment. When a local model is responsible for only a single data point, there is a danger that it will assign an extremely high-probability density to the observed reading by collapsing the width of the gaussian component of its output distribution toward zero. This form of overfitting is avoided by tying together the widths of the output gaussians for all the local models for one sonar sensor. Generally the more RBFs used, the better the performance tended to be, at the expense of learning time. We investigated some methods for pruning and adding RBF units during learning, with encouraging results, though beyond the scope of the current discussion (e.g., this could

690

Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek

be very useful in an environment that changes drastically over time). When x is known, all parameters—a, b, and σ , as well as those associated with the RBF—can be learned by maximizing log Pn (rn | x) (i.e., by gradient descent), but when the robot’s true location is unknown and all that is available is the robot’s current probability distribution over cells, it is less obvious what should be optimized. Our unsupervised algorithm, which is closely related to expectation maximization, maximizes the expected log likelihood of the sonar signatures, where the expectation is taken with respect to the posterior probabilities of being at each grid cell: Z (2.7) hlog P(r | x)iP(x) = log P(r | x)P(x)dx x

≈

X

log P(r | gi )Pposterior (gi )

(2.8)

i

where Pposterior (gi ) was calculated in equation 2.1. We used the conjugate gradient method to tune the parameters. 3 Experimental Results 3.1 The Simulated Environment and Noise. The environment and sonar data used in this work were generated synthetically by a sophisticated simulator provided by Wilkes et al. (1990). This simulator used a ray-tracing algorithm to capture many of the reproducible forms of nongaussian noise that can occur with sonar sensors, such as phantom holes, phantom walls, and occasional failures to detect nearby objects. Furthermore, to account for the noise that is irreproducible, we have made the worst-case assumption by replacing a certain fraction of the readings by random values chosen from a uniform distribution in the interval (0, rmax ). This replacement was done by assigning each sensor i an “unreliability” quotient qi , so that any reading taken by sensor i would have a probability qi of being replaced by a random number. The unreliability quotients were {0.1, 0.5, 0.001, 0.01, 0.1, 0.001, 0.1, 0.2, 0.001, 0.001, 0.1, 0.05}. In addition, to make the task more difficult, all sensor readings that attained the maximum value rmax were also replaced by a random number. Even for sensor readings considered “reliable,” gaussian noise was added (with standard deviations of {4, 4, 3, 3, 5, 5, 5, 5, 2, 1, 2, 10} cm). Note that the net does not attempt to model the relationship between the value of a sonar reading and the distance to the nearest object, only the relationship between the value of the sonar reading and the position of the robot. This means that any reproducible effects can potentially be utilized for better localization, including the “nearby objects appearing far away” effect, as long as there is some degree of consistency some of the time. The fraction of the time when there is no consistency can simply be modeled by the uniform noise-mixing proportion.

A Mobile Robot that Learns its Place

691

Figure 3: Top-down view of complete sonar signatures at two points 10 cm apart. The position of the robot is marked by the circle near the middle of the room (the one on the left is just slightly further north than the one on the right). The thicker lines are meant to be reflective surfaces: walls, dividers, chair legs, feet, a soda can under a desk, and a door. Each of the thin lines centered at the robot represents the distance that was measured by the sensor facing in that direction. For example, using a clock numbering of the sensors, the distance to the east wall was accurately measured in both cases by sensor 3, while some confusion occurred with sensor 9 as to the location of the western wall due to the interference of a chair base on the right. The lines that touch the outer boundaries—the measurements of sensor 2 and sensor 4—represent the directions from which no strong echo was returned.

The environment used in all of the following tests was a cluttered office, about 5 m by 5 m. Figure 3 illustrates the environment and signatures taken from two nearby positions. Note the difference in the two signatures. These signatures have not yet had any random noise added to them. We believe the clutter in the environment to be a realistic model of working environments, where numerous objects are left all over the place. 3.2 Tracking Algorithm. The unsupervised algorithm relies on the premise that combining the knowledge of the motion with the sonar-based network predictions will produce better position estimates than those that would be obtained by using either source of information alone. To demonstrate that this is typically true, we used supervised training of the network and then compared the errors of the dead-reckoned estimates (motion alone), the network estimates (sonar alone), and the filtered estimates (sonar and motion combined) as the robot travels along a path. The position is actually represented by a probability distribution over a grid, but for these comparisons, we used the mean of the distribution as the estimate.

692

Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek

Figure 4: Absolute errors in the predicted position for a single path. The noise in the motion sensors is multiplicative gaussian with a mean of zero and standard deviation of σmotion = 5 cm for every 20 cm of motion.

In Figure 4, a network was first trained on a set of 80 labeled examples taken from a boxed region of the position space, and then a random path 30 steps long was traversed without giving position information. The tracking algorithm is successful; the errors of the filtered predictions in Figure 4 are roughly equal to the smaller of the sonar- and motion-based prediction errors. Also shown is the expected value of the absolute error that would be obtained over many runs of a single path using the same noise model for the motion sensors. This smooth curve would continue to increase as the square root of the number of time steps. The variance in the errors of the dead-reckoned predictions would increase as well. At the early stages of the unsupervised learning, the network will typically make larger errors, but these will be more easily overwhelmed by the motion constraints, since the distributions predicted by the net will have a higher entropy. In Figure 5, a network was used that had been trained on only 20 examples. Although the variance in the sonar sensor predictions was considerably higher, the filtered predictions were still quite good. We also did runs in which we kept the robot’s motion model as described above (i.e., zero-mean gaussian), but affected the motion sensings with a systematic 10% overestimation error in addition to the gaussian noise (i.e., biased noise). Still, the filtered errors generally remained below 10 cm (Oore, 1995). A minimally trained network can be quite useful provided the net does not give too much confidence to its incorrect predictions. Occasional, accurate, high-certainty predictions made by the net cause a confident posterior

A Mobile Robot that Learns its Place

693

Figure 5: Errors in the predicted position for a single path. The motion sensings were affected by a systematic error (in which all traveled distances were overestimated by a factor of 10%) as well as a gaussian noise process with a larger variance, σMotion = 10 cm for every 20 cm traveled. The net was trained on 20 examples.

estimate. The motion model will then maintain this accuracy for a number of time steps, even if the subsequent predictions by the network are poor. 3.3 Unsupervised Learning. The motivation for an unsupervised method for robot learning begins with the dilemma that dead reckoning alone is insufficient but manual recalibration is too expensive to be practical. The aim of the unsupervised learning procedure is to discover a good generative model of the relationship between location and sonar readings. If the true location is known, this can be done by maximizing the likelihood of each observed sonar reading under the probability distribution produced by the model. Since the true location of the robot is unknown, the best we can do is to estimate the probability that it is at each grid location and then weight the learning by these estimated probabilities. 3.3.1 Training Procedure. The general training procedure used for the unsupervised learning consisted of a few steps: 0. Initialization. The 12 networks for predicting the likelihood distribution of a sonar reading from the coordinates of a grid point were initialized with small random weights, except that the normalized RBF centers were distributed uniformly over the environment; the

694

Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek

robot was placed in the room without any position information (i.e., a uniform distribution over the grid). 1. Data collection. The robot traversed a short path, consisting of a sequence of steps. An obstacle-avoidance mechanism was assumed in generating the path. A sonar signature was taken at each step, and the filtering algorithm given by equation 2.1 was used to update the probability distribution over grid cells by integrating the motion and sonar sensings. The updated distribution over cells was combined with the current sonar signature to yield weighted training examples of grid cell–sonar signature pairs. The examples were accumulated over all the steps in the path. 2. Training. Using the data just collected, the network was trained for a fixed number of iterations, after which the training set was discarded. The last filtered position distribution from the previous path was retained as the initial distribution when returning to step 1 to acquire more data. The weight vector of the trained net was retained as the starting point for the next minimization. Although the paths were random, they were generated to drift slowly from left to right and back again (and again) while drifting up and down more quickly. 3.3.2 Performance of the Unsupervised Learning Algorithm. The cluttered office environment was used, with a motion noise of σMotion = 2.5 cm for every 20 cm traveled. The network was trained for 30 conjugate gradient iterations after every 30 steps. Figure 6 shows the actual paths traveled and the paths as predicted from the sonar readings alone by the network, after 3 and then after 35 paths. The first peculiarity to observe is that although the filtered estimates correctly capture the shape of the true path, they are a translation of it. This can be explained by noting that the motion sensings provide relative position information, while the only indications of absolute position are an initial distribution, if given, and the edges of the grid if they are reached by the exploration. Thus, since the paths we used systematically continued to touch the east-west walls but did not extend fully in the northsouth direction, it is not surprising that the robot constructed a model that was self-consistent (that is, consistent with the motion constraints) but was a vertical translation of the actual environment. Examining the sonar-based errors, we can see that in the earlier predictions, the network seemed to get the right vicinity (modulo a translation) but made gross errors. After continuing to train over another 32 paths, the network learned a model that was a more accurate translation of the true environment. In both cases, the majority of the larger errors in the sonar predictions were corrected by the filtering algorithm.

A Mobile Robot that Learns its Place

695

Figure 6: Real vs. predicted paths. The axes of the graphs can be seen as representing a top-down view of a portion of the environment, in which the robot took a zigzag path going from west to east. For visual clarity, the paths have not been adjusted to match up. Inspection will reveal that the sonar sensings in the lower left-hand region of the earlier path are very loosely related to the shape of the true path. Nevertheless, it is immediately clear that they correspond to the right general region. In contrast to this, looking at the right-hand side of the later path shows that the sonar-based estimates exhibit a much closer match to the filtered and true positions.

696

Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek

Figure 7: Learning while exploring. The means were calculated over sets of paths, and the standard deviations were estimated from the sample, assuming gaussian noise (which is incorrect for values that cannot be negative).

Figure 7 shows the filtered and sonar errors with error bars for paths as the learning progresses, after performing the appropriate translation.4 We see that the filtered predictions were almost always better and had significantly smaller error bars than the sonar predictions. At around the fiftieth path, the variance in the sonar errors suddenly increased, for at this point, the environment was rearranged (chairs were added, moved, etc.). The average error did not increase by much because it was only in certain parts of the room that the network’s predictions were affected. This allowed the motion constraints to maintain accurate localization for a few time steps and then, before the filtered predictions started to deviate by much, a few confident and accurate sonar predictions in a familiar part of the room were all that were needed to recalibrate the tracking effectively. Since the normalized RBFs respond locally, they can quickly and easily adapt to local changes in the room. Although the network acted desirably in this case, problems can arise if too many consecutive, overconfident, but inaccurate sonar predictions are made. The threshold that determines overconfidence is, in part, a function of the noise model in the motion sensings. A possible danger is that if a certain region is avoided during initial training because of an obstacle, then when that obstacle is later removed, there may still be a hole in the posterior where the object used to be. However, 4 Done by calculating the mean error over a path and subtracting this from each prediction in the path.

A Mobile Robot that Learns its Place

697

this will be prevented to a certain extent for two reasons. First, during the initial period of training, if the distribution of the robot’s position is very entropic, the parameters for the distributions will be learned so that the entire position space is covered, and therefore those places that are not contained in the posterior distributions at later stages will still have been reasonably initialized early on. Second, for many of the sensor readings (in particular, those that are not pointing at the obstacle in question), the readings on either side of the obstacle are quite likely to belong to the same linear region, and therefore the corresponding linear predictors are already tuned to interpolate correctly through the occupied region. Various modifications in the training procedure can significantly improve the tracking and learning. For example, the learning can be accelerated by starting with a net trained on a small number of labeled examples rather than using a purely random initial net. Alternatively, the initial net can be untrained, but accurate position information can be given at the beginning of each of the first few paths. If the robot is returned to the same part of the room to start each path, it learns to make confident and accurate predictions in that region, and with further exploration it gradually expands the region in which its model is accurate. The exploration strategy and length of paths between minimizations are also important. If too many of the training points in one path come from the same part of the room, then the centers of the RBF units need to be frozen to prevent an uneven allocation of the network’s resources. Although some of these variations speeded up the learning, they were not crucial and were not used in the runs we have described. 4 Relationships to Other Approaches 4.1 Hidden Markov Models. The cells in the grid correspond to the states in a hidden Markov model, and the uncertainty in the movement information corresponds to the matrix of transition probabilities between states. The fact that this matrix changes at each time step is unusual but not incompatible with the standard hidden Markov model framework. Instead of having a separate output model at each state, all of the states share the same 12 neural networks, but the likelihood distributions produced by these nets are contingent on the real-valued location coordinates associated with a state. These coordinates change in a deterministic way as the robot moves, but this does not create any additional problems. In estimating the parameters of the output models, we make one major simplification relative to the Baum-Welch method that is normally used for hidden Markov models (Baum & Eagon, 1967). The correct way to compute the posterior probability distribution over cells given a sequence of vectors of sonar readings is to use the forward-backward algorithm. Repeated application of equations 2.1, 2.2, and 2.3 corresponds to the forward pass, but we ignore the backward pass and so cannot make use of the information that sonar readings at time t provide about the true location of the robot at

698

Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek

times earlier than t. The advantage of our suboptimal method of estimating posterior probabilities is that it can be used online. Also, Neal and Hinton (1993) show that the convergence of expectation maximization does not require the correct posterior distribution to be used. All that is required is that the E-step produce a distribution that has a smaller Kullback-Liebler distance from the true posterior than the previous estimate.5 This is no longer guaranteed when the backward pass is ignored, but it is unlikely that a forward pass using the current parameters in the output models will give a worse estimate of the true posterior given those parameters than a forward pass using the previous parameters. 4.2 Kohonen-Style Self-Organizing Maps. Krose ¨ and Eecen (1994) show that it is possible to compute the location of a robot by first building a topographic map from the vectors of sonar readings alone. The use of a selforganizing map (SOM) partially overcomes the need for labeled training data, but unfortunately, the function that is optimized is not appropriate for the location task. If we view the SOM in generative terms, it attempts to minimize the squared reconstruction error when the input vector is reconstructed from the winning hidden unit or from one of its neighbors. This would be the right thing to minimize if the constraint on sonar readings was that neighboring locations produced similar readings. But this is not true when the robot passes close to the edge of a surface. Provided we know that the movement is small, a much more realistic constraint is that temporally adjacent readings come from neighboring locations. This is precisely the constraint captured by our approach if we treat the movement as random gaussian noise. Obviously, if we know more about the precise movement, we can make this constraint stronger. The relationship between our method and the SOM can be understood by viewing them as two quite different elaborations of a simple mixture model in which each hidden unit has a vector of 12 parameters that represent expected values for the 12 sonar readings. The SOM learning algorithm elaborates this model by forcing neighboring hidden units to have similar parameter vectors. It also uses a hard decision rule for deciding on a winner instead of using posterior probabilities to weight the learning. We elaborate the simple mixture model by using nongaussian output models for each cell in the grid and by tying all these models together to reduce greatly the number of free parameters. Different cells can still predict different probability distributions for the sonar readings because the predictions are conditional on the location coordinates of the cell. Finally, instead of just using the current sonar readings to compute the posterior probability of a cell, we take into account the earlier readings and the motion information, thus allowP

Surprisingly, the relevant Kullback-Liebler distance is Q log(Qα /Pα ) where Q α α is the estimated probability distribution and P is the true posterior. 5

A Mobile Robot that Learns its Place

699

ing the temporal constraints that are ignored by the SOM to influence the learning. Acknowledgments This research was funded by the Institute for Robotics and Intelligent Systems and by the Natural Science and Engineering Research Council. Hinton is the Nesbitt-Burns fellow of the Canadian Institute for Advanced Research. We thank Radford Neal, Allan Jepson, and Brendan Frey for helpful comments. References Baum, L. and Eagon, J. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematicians Society, 73. Dudek, G., & Hinton, G. (1992). Navigating without a map by directly transforming sensory input to location. Unpublished manuscript. Dudek, G., Jenkin, M., Milios, E., & Wilkes, D. (1993). Reflections on modelling a sonar range sensor (Tech. Rep. No. CIM-92-9). Montreal: McGill Research Centre for Intelligent Machines, McGill University. Krose, ¨ B., & Eecen, M. (1994). A self-organizing representation of sensor space for mobile robot navigation. In Proceedings of the IEEE/RSJ/GI International Conference on Intelligent Robots and Systems (pp. 9–14). New York: IEEE. Mackay, D. (1991). Bayesian modelling and neural networks. Unpublished doctoral dissertation, California Institute of Technology, Pasadena, CA. Moravec, H. (1988). Sensor fusion in certainty grids for mobile robots. AI Magazine, pp. 61–74. Neal, R., & Hinton, G. (1993). A new view of the EM algorithm that justifies incremental and other variants. Unpublished manuscript. Oore, S. (1995). A mobile robot that learns to estimate its position from a stream of sonar measurements. Unpublished master’s thesis, University of Toronto. Thrun, S., Buecken, A., Burgard, W., Fox, D., Froehlinghaus, D., Henning, D., Hofmann, T., Krell, M., & Schmidt, T. (In press.) Map learning and high-speed navigation in RHINO. In D. Kortenkamp, R. Bonasso, & R. Murphy (Eds.), Case Studies of Successful Robot Systems. Cambridge, MA: MIT Press. Wilkes, D., Dudek, G., Jenkin, M., & Milios, E. (1990). A ray following model of sonar range sensing. In Proc. Mobile Robots V, pp. 536–542. Bellingham, WA: International Society for Optical Engineering.

Received December 5, 1995; accepted August 2, 1996.

REVIEW

Communicated by Terrence J. Sejnowski

Similarity, Connectionism, and the Problem of Representation in Vision Shimon Edelman Sharon Duvdevani-Bar Department of Applied Mathematics and Computer Science, Weizmann Institute of Science, Rehovot 76100, Israel

A representational scheme under which the ranking between represented similarities is isomorphic to the ranking between the corresponding shape similarities can support perfectly correct shape classification because it preserves the clustering of shapes according to the natural kinds prevailing in the external world. This article discusses the computational requirements of representation that preserves similarity ranks and points out the relative straightforwardness of its connectionist implementation. 1 Introduction 1.1 Two Problems of Representation. It is possible to distinguish between two problems about mental representation (Cummins, 1989). The first of these, the problem of representations (plural), is basically empirical; two instances of this problem are the search for the representations employed by natural cognitive systems and the design of the representational substrate for artificial cognitive modules. The second problem is what Cummins calls the Problem of Representation (singular). Here, the central question is how, in principle, an internal state of a system can refer to anything at all in the external world. Curiously, the Problem of Representation is rarely discussed in cognitive psychology. On the one hand, arguments put forward by Marr (1982) prompted cognitive scientists to contemplate the fact that not all representations are computationally equivalent; the classical example is the difficulty of doing arithmetics with roman numerals, compared to the ease of calculation using positional notation. On the other hand, it is generally taken for granted that a certain pattern of neuronal activity in someone’s brain can represent the shape of a cat or, amazingly, that physically distinct patterns of activity in the brains of two people can stand for the same shape (that of a cat). 1.2 First-Order Isomorphism: Representation by Similarity. Reviews in cognitive psychology tend to adopt extreme positions concerning representation. For example, Palmer (1978) asks, “What is representation?” and Neural Computation 9, 701–721 (1997)

c 1997 Massachusetts Institute of Technology °

702

Shimon Edelman and Sharon Duvdevani-Bar

then proceeds with an in-depth survey of the field, the upshot of which is that “we . . . do not really understand our concepts of representation.” In contrast, Suppes, Pavel, and Falmagne (1994) state flatly, “Representation of something is an image, model, or reproduction of that thing.” This amounts to the Aristotelian idea of representation by similarity; according to it, a representation, say, of a cat, can help you know about cats only if it resembles one. Among the problems raised by this notion of representation is the obligatory involvement of an observer: Cartoon cats resemble real cats to us; the two kinds share no properties other than perceptual. Another problem is the inability of representations of this type to deal with abstraction (does having an image of a striped cat qualify as a representation of a calico cat?). Because of these and other considerations, in the philosophy of mind the idea of representation by similarity has been discredited so thoroughly as to inspire a rare consensus (Cummins, 1989). 1.3 Second-Order Isomorphism: Representation of Similarity. Barring resemblance, or “first-order structural isomorphism” (Shepard, 1968), between the object and the entity that stands for it internally, what relationship can qualify as representation of something by something else? Shepard (1968) suggests “second-order” isomorphism: “The isomorphism should be sought—not in the first-order relation between (a) an individual object, and (b) its corresponding internal representation—but in the second-order relation between (a) the relations among alternative external objects, and (b) the relations among their corresponding internal representations. Thus, although the internal representation for a square need not itself be square, it should (whatever it is) at least have a closer functional relation to the internal representation for a rectangle than to that, say, for a green flash or the taste of a persimmon” (Shepard & Chipman, 1970, p. 2). More recently, the idea of representation by second-order isomorphism has been advanced, under various guises, in a number of fields in cognitive science. Typically the researchers in these fields take for granted the implausibility of representation by similarity, that is, by first-order isomorphism. Consequently the theories mention merely “isomorphism,” implying that the isomorphism holds between structures (and is therefore “second order,” in Shepard’s terms), and not between individual entities. A particularly clear exposition of this point can be found in Gallistel (1990), where the example of numerical representation of physical quantities such as mass is discussed. Gallistel points out that “the statement that the number 10 is isomorphic to some particular mass is meaningless. It is the system of numbers that is isomorphic to the system of masses” (p. 24). Gallistel then proceeds to defend the view that cognitive systems represent space and time by harboring structures that are isomorphic to corresponding structures in the world, subject to the qualification regarding the importance of entire structures (that is, elements plus relations),

Similarity, Connectionism, and Representation in Vision

703

as opposed to individual elements. A similar proposal, albeit in the context of representation of conceptual rather than perceptual spaces, can be found in Holland, Holyoak, Nisbett, and Thagard (1986). Here the theory is couched explicitly in terms of a homomorphism between a collection of states and a set of possible transitions between states in the world, and in the internal representation. This version of representation by second-order isomorphism has been discussed very recently in Cummins (1996), where it is, unfortunately, called “The Picture Theory of Representation.”1 Importantly, Cummins echoes Gallistel in pointing out that constituents of a structure are meaningful only in the context of some structure. In the context of visual object representation, this observation translates into the notion that only certain relationships between the objects—not the shapes of the individual objects themselves—need to be represented. What could these relationships be? Shepard’s example, stated in the preceding section, clearly suggests that the represented structure should be that of similarity relationships between objects. Accordingly, this approach can be termed “representation of similarity” (see Figure 1). Our goal in the rest of this article is to outline some of the mathematical issues raised by this approach to representation and to examine its viability in the context of shape vision. 2 Toward Veridical Representation 2.1 Preliminaries. Let us now identify mathematical conditions under which isomorphism between distal and proximal relations gives rise to representation, which is, in a sense, true to the original. Let D be the set of distal objects and P the set of internal tokens that participate in the process of representation. Let f : D ⊂ D → X and g: P ⊂ P → Y be functions defined, respectively, over sets of distal and proximal entities (no restriction needs to be placed at this stage on the number of arguments of f, g). According to the requirement of second-order isomorphism, D is isomorphic to P under a mapping M if ∀D ⊂ D, f (D) ∼ g(M(D)), where the relation ∼ is that of set isomorphism, defined over X , Y . To constrain the choice of M, we have to be more specific in defining the functions f, g. In the context of shape perception, it is natural to consider similarity for this purpose (actually, two similarity functions must be defined—one for the external objects and one for the internal tokens standing for those objects). Intuitively, we would like the representation to capture the similarity relationships within and between natural kinds (Quine, 1969). More specifically, similarity is seen as relevant to both recognition, in which case resemblance between

1 Unfortunately, because in vision research, pictorial representations are traditionally associated with the Aristotelian notion of representation by similarity, or first-order isomorphism.

704

Shimon Edelman and Sharon Duvdevani-Bar

@@ @@ G

G1

3 C1

G2

ρd(G1,G2) < ρd(G2,C1)

distal proximal

ρp(g1,g2) < ρp(g2,c1) g1

g3

@@ @@

c1

g2

Figure 1: Clustering by natural kinds and its representation that fulfills the requirement of second-order isomorphism (here, isomorphism between distance ranks in the two spaces; ρd and ρp are the distal and the proximal distance functions).

the viewed shape and some previously seen ones is to be assessed, and classification, where similarities between a shape and a number of shape classes are compared (Nosofsky, 1988; Goldstone, 1994). For a pair of internal representations, similarity is not directly observable but can still be defined operationally as the degree to which an optimal stimulus for one token activates the other one (Shepard & Chipman, 1970). As noted by Nicod (1930) and Quine (1973), similarity should be construed as a triadic relation: Knowing that A is more similar to B than to C is much more informative than merely knowing that A is similar to B and B to C. Furthermore, we may assume that similarity is defined qualitatively and not quantitatively, that is, the range of f, g is X = Y = {>, <}. This assumption limits neither classification (because knowledge of similarity for every triplet of points belonging to a metric space defines unequivocally the clustering of the points) nor recognition (because the location of each point can be recovered from triadic similarities using nonmetric multidimensional scaling; see section 3.1). We thus obtain f : D × D × D → {>, <} and, analogously, g: P × P × P → {>, <}. The condition on M then be-

Similarity, Connectionism, and Representation in Vision

705

comes ∀D1 , D2 , D3 ∈ D, f (D1 , D2 , D3 ) = g(M(D1 ), M(D2 ), M(D3 )). In other words, the mapping M is required to preserve similarity ranks. 2.2 Constraints on the Distal Structure. The notion of similarity can be naturally quantified if both the distal shape space and the internal representation space are endowed with metric structure, in which case (dis)similarity in each space can be derived from the appropriate distance function. In the distal space, the existence of a metric for a given shape family (i.e., a subspace of the shape space) depends on the possibility of coming up with at least one parameterization scheme common to all the shapes in the family (the issue of different parameterizations of the same set of shapes is discussed in section 2.3.3). A metric on shapes can then be defined simply as the Euclidean distance in the underlying parameter space Rn . Obviously only a small part of that space is occupied by objects that “look like something”—a combination of parameters picked at random is likely to define a nondescript collection of voxels. It is important to realize, therefore, that the goal behind the introduction of parameters is to formalize the idea of an “objective” similarity for shapes, and not to invent a language in which only some legitimate objects possess valid descriptions. With that qualification, the task of defining an appropriate parameterization becomes quite easy, and can even be automated to a certain extent. The performance of one possible approach to a common parameterization of diverse shapes is illustrated in Figures 2, 3, and 4. According to this approach, the parameterization is computed in a straightforward manner by assigning a dimension to each voxel in a fine subdivision of the bounding box of the object. As a result, it becomes possible to consider various objects as points in a common parameter space and to morph (smoothly deform) one object into another, as shown in Figure 3 (this process corresponds to a movement in the parameter space along the line that connects the points corresponding to the two objects). Although the voxel-based parameter space is high-dimensional, a small number of dimensions may suffice to specify complex and diverse shapes, as illustrated in Figure 4, which examines the utility of principal component analysis on the space of shape parameters. Note that the possibility of a common parameterization of a variety of object shapes provides a substrate for an objective similarity structure of the world, to be reflected internally by the representational process. Thus, it fulfills the first basic prerequisite for representation by second-order isomorphism. The second prerequisite, having to do with the manner whereby the distal similarities are mapped into the representational system, is the topic of the next section. 2.3 Constraints on the Distal-to-Proximal Mapping. 2.3.1 Distance Rank Preservation. As we saw in the beginning of this section, preservation of the ranks of parameter-space distances among ob-

706

Shimon Edelman and Sharon Duvdevani-Bar

Figure 2: (Top) Images of several three-dimensional objects. (Middle) The reconstruction of the objects using linear resolution of 10 divisions per side of the bounding box (implying a 103 -dimensional parameter space, as explained below). (Bottom) Reconstruction under linear resolution of 30 divisions per side of the bounding box (303 parameters). Given a linear resolution parameter u, the parameterization program subdivides the bounding box of the object into u × u × u voxels. The vector of occupancy indices for the voxels thus constitutes a geometrical description of the object in terms of u3 parameters. The resulting parameter space is universal (common to all the objects), as illustrated in Figure 3, where the common parameterization is used to produce morph sequences between different objects.

Figure 3: Successive stages in a morphing (smooth deformation) process linking distinct points in a common shape space. (Top) Morphing a teapot into an apple. (Bottom) Morphing an airplane into a shark. Both examples of morphing used a 303 -dimensional parameterization of the objects.

Similarity, Connectionism, and Representation in Vision

707

Figure 4: (Top) Images of several three-dimensional objects. (Middle) Images of the same objects, parameterized with 253 parameters and rerendered. (Bottom) Images rendered from representations of the objects in a common fivedimensional parameter space, obtained from the high-dimensional voxel-based space using principal component analysis. The weights of the five leading principal components are: knight al canstick shell ape pawn

−0.19 −12.01 −2.96 19.36 −4.62 0.43

4.60 −16.19 7.65 −7.52 6.00 5.45

6.82 1.00 1.85 −2.37 −13.46 6.15

5.37 −0.53 −12.87 −1.47 3.79 5.72

6.94 −0.32 −0.08 0.08 0.30 −6.91

jects by the distal-to-proximal mapping suffices for a faithful representation of their similarity structure. The implications of this requirement for the choice of distal-to-proximal mapping depend on whether ranks need to be preserved fully or partially. Strict rank preservation. The similarity ranking of three distinct parameter vectors x, y, z ∈ Rn can be formalized by the notion of their simple ratio, defined as hx , y , zi = |x − y|/|y − z|. Obviously a mapping S: Rn → Rn preserves distance ranks iff it preserves hx , y , zi for any choice of points. A bijective mapping S with this property must be a similitude, that is, a mapping of the form S(x) = λ P(x), where λ > 0, λ ∈ R, and P: Rn → Rn is an orthogonal transformation (Reshetnyak, 1989). Thus, the requirement of rank preservation is quite restrictive in the class of mappings it allows. Local rank preservation. In one or two dimensions, the rank preservation requirement is satisfied locally by any well-behaved mapping. Specifically,

708

Shimon Edelman and Sharon Duvdevani-Bar

a mapping realized by an analytic function with a nonvanishing Jacobian in a given region is conformal there (Cohn, 1967). Such a function preserves similitude of small triangles; in particular, a scalene triangle formed by a triplet of points will be mapped into a triangle with the same ranking of side lengths (see Figure 1). In higher dimensions, conformality is much more restrictive. As proved by Liouville in 1850, already for n = 3 there are no globally conformal mappings from Rn to itself besides those composed of finitely many inversions with respect to spheres. Such mappings, called Mobius ¨ transformations, constitute a finite-dimensional Lie group, which includes the group of motions in Rn and is only slightly broader than that group (Reshetnyak, 1989). Local approximate rank preservation. A considerably broader class of mappings emerges if the requirement of conformality is replaced by that of quasiconformality. Intuitively, a regular topological mapping is quasiconformal if there exists a constant q, 1 ≤ q ≤ ∞, such that almost any infinitesimally small sphere is transformed into an ellipsoid for which the ratio of the largest semiaxis to the smallest one does not exceed q (Reshetnyak, 1989). Under such a mapping, the ranks of distances between points are preserved approximately, on a small scale (V¨ais¨al¨a, 1992, p. 124). 2.3.2 The Implications of Locality. How relevant is local approximate preservation of distance ranks, offered by a quasiconformal distal-to-proximal mapping, to the representation of real-world shapes? An answer to this question is suggested by the realization that distance ranks need not be preserved globally across the entire shape space. The limits to the extent of distance rank preservation expected of a perceptual system are suggested by the hierarchical structure of perceived categories, revealed in numerous studies in cognitive science. Within this hierarchy, there exists a certain level, called basic level, which is the most salient according to a variety of criteria (Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976). Taking as an example the hierarchy quadruped, dog, terrier, the basic level is that of dog. Objects whose recognition implies less detailed distinctions than those required for basic-level categorization are said to belong to a superordinate level. A reasonable assumption is that faithful representation of similarities is required within superordinate categories (e.g., within quadruped: giraffe to camel, leopard, horse) and, of course, within basic categories, but not between superordinate categories. In other words, the informativeness of the statement that a giraffe is more similar to a banana than to a beetle is questionable. At the same time, one expects a giraffe to be represented as more similar to other members of the quadruped category than to either a banana or a beetle.2 2

Similarity across superordinate category borders may make sense if the comparison

Similarity, Connectionism, and Representation in Vision

709

2.3.3 The Components of the Distal-to-Proximal Mapping M. A mapping implemented by a perceptual system can be described generically as M = dimens ◦ measure ◦ imag ◦ geom, where the first two components are dictated by the properties of the world and the other two constitute part of the system: Geometry. geom(p): The shape of the object, as a function of the distal parameter-space description p. Imaging. imag(p; z): The image on the receptor surface, as a function of the shape parameters p and the viewing conditions z. Measurements. measure(p; z): The set of internal measurements performed on the image. Dimensionality reduction. dimens(p): The low-dimensional representation, independent of z. Let us now examine the constraints imposed on the mapping M by the requirement of second-order isomorphism between similarity patterns in its domain and its range. Note that some of the components of M are not under the control of the perceptual system while others are. The components determined by the system’s environment are likely to have a detrimental effect on the proper representation of similarity, for example, by introducing extraneous sources of variability into the representations. Each of the components is therefore worth considering separately, to determine whether the controllable components can compensate for the effects of the uncontrollable ones. Consider first the effect of the geometry mapping, geom. Although a perceptual system has no control over this mapping, no such control is necessary: Parameterizations that are “unnatural” in the sense that they encode the geometry of objects in a nonsmooth manner will be excluded from possible recovery by perceptual systems based on the proposed principle, but there will always be plenty of other, “natural” parameterizations that will be amenable to veridical representation. Let P be the set of all parameterizations related to a given one p0 by some conformal mapping T. The set P is an equivalence class (V¨ais¨al¨a, 1971); moreover, because the composition T ◦ M is conformal if M is, veridical representation of some p ∈ P is equivalent to the representation of any other p ∈ P . Now a conformal mapping M will give rise to a proper (i.e., second-order isomorphic) representation of object clustering under all parameterizations belonging to some class Px . The nature of that class will depend on the nature of the mapping (which can emphasize some distances among objects at the expense of others, with or without altering the concentrates on a few isolated dimensions, which differ from one case to another. For example, giraffes may be found similar to “tall” things (a palm tree, a lamp post) or to “spotted” things (a leopard, a camouflage net). Typically, similarity in those cases is markedly intransitive (a palm tree is not similar to a leopard). See Tversky (1977).

710

Shimon Edelman and Sharon Duvdevani-Bar

distance ranks). A system that is a product of natural selection is expected to have evolved a mapping more suitable for the representation of those aspects of its habitat most important for its survival and behavior. Thus, while veridical representation is possible, it is also possible that two perceptual systems implementing different mappings will have incompatible (or even conflicting) pictures of the world. The next component of M—the viewing mapping, imag—introduces a dependence on variables z that are extraneous to the shape parameters that are to be represented. These variables encode the pose of the object with respect to the observer, to the light sources, and to the other objects in the scene. Their influence must be counteracted by the perceptual system, through the combined action of measurement and dimensionality reduction, dimens ◦ measure, to reduce the likelihood of two nearby parameterspace points (i.e., two similar shapes) being mapped into widely disparate points in the final representation space. Note that absolute invariance with respect to these variables is not necessary; it is required only that shapespace changes influence the measurements more strongly than view-space changes (more on this in section 3.2). Furthermore, not all the dimensions of z have to be treated by the same mechanism: Image-plane translation can be compensated for by a covert shift of attention (Anderson & Van Essen, 1987) or an overt one (such as a saccadic eye movement), variation in apparent size by global scaling using a hard-wired mechanism (Schwartz, 1985), and rotation in depth by learning an appropriate normalizing mapping specific for each object class (Poggio & Edelman, 1990). In addition to the requirement that dimens ◦ measure counteract the effects of the view-related variables, one may note that the preservation of distance ranks implies that any change in the distal parameter space must be reflected in the final low-dimensional representation (otherwise, if some of the original dimensions collapse under the representation, distances between points are likely to be distorted). To ensure that as many as possible of the original dimensions of variation among the distal objects are preserved, it is worthwhile to make as many varied measurements as possible. This makes the measurement space (defined by measure) high-dimensional and necessitates subsequent dimensionality reduction (through the action of dimens; the existence of a low-dimensional parameterization of the distal shape space constitutes another computational justification for dimensionality reduction). In a flexible system, dimensionality reduction would have to involve learning to find informative dimensions, depending on the statistics of the input and (if available) of additional knowledge provided by the environment (see, e.g., Intrator, 1993). To summarize, the message of this section is not mere reiteration of the importance of invariance or dimensionality reduction. Rather, it is the possibility of achieving formally veridical representation, provided that those two problems are addressed with the proper computational goal in mind. According to the proposed theory of representation, that goal is preservation

Similarity, Connectionism, and Representation in Vision

711

of parametrically defined similarity by the distal-to-proximal mapping. The next section presents empirical evidence in support of the proposed theory. Because our emphasis is on brevity and variety, we make only brief mention of selected results from psychophysics, computational modeling, and neurophysiology of object representation. 3 Empirical Support 3.1 Psychophysics. In his discussion of cognitive representation, Palmer (1978) notes that “in order to test the two models [first- and second-order isomorphism], one must have access to the mapping function from the outside world” (p. 293). Such access can be realized by combining parametric shape generation with a psychological technique known as multidimensional scaling (MDS) (Kruskal & Wish, 1978; Shepard, 1980). Nonmetric MDS is an algorithm for embedding a set of points in a metric space while preserving as closely as possible the ranking of pairwise distances between the points, specified in advance (in psychophysical applications, this ranking is derived from the subject’s response data). A number of recent works have explored the veridicality of parametric shape representation using MDS (Edelman, 1995a; Cutzu & Edelman, 1996; Sugihara, Edelman, & Tanaka, 1996). In each of a series of experiments that involved pairwise similarity judgment and delayed matching to sample, subjects were confronted with several classes of computer-rendered three-dimensional animal-like shapes, arranged in a complex pattern in a common high-dimensional parameter space (see Figure 5, left). Response time and error rate data were combined into a measure of subjective shape similarity, and the resulting proximity matrix was submitted to nonmetric MDS. In the resulting solution, the relative geometrical arrangement of the points corresponding to the different objects invariably reflected the complex low-dimensional structure in parameter space that defined the relationships between the stimuli classes (see Figure 5, middle). 3.2 Computation. 3.2.1 From Shepard’s Law of Generalization to Veridical Representation of Similarities. The results of the psychophysical experiments described above indicate that subjects internalize a veridical replica of the parameter-space configuration formed by the stimuli, which in turn can be recovered from the response data. Interestingly, the same mechanism that has been invoked in the past as a model of simple generalization with respect to a reference stimulus—a classifier whose response falls off exponentially with the distance between its preferred stimulus and the current input (Shepard, 1987; Shepard & Kannappan, 1993)—can also serve as a basis for modeling veridical representation of the shape space in humans, as explained below.

712

Shimon Edelman and Sharon Duvdevani-Bar

A monotonic falloff of the likelihood of generalization with increasing distance between the test and the reference stimuli has been observed in a wide range of experiments, spanning visual and other modalities. Shepard (1987) surveyed empirical data on generalization gradients and showed that they can be described by a decaying exponential, P(x; x0 ) = e−d(x,x0 )/σ , where P(x; x0 ) is the probability of generalizing to stimulus x the response previously made to x0 , and d(x, x0 ) is the normalized distance in psychological (representation) space. Shepard (1987) also showed that this exponential generalization law can be derived from first principles and is relatively insensitive to the form of the distance function entering the computation. 3.2.2 A Multiple-Classifier Model. Now consider a bank of k classifiers, each tuned to a particular point in the distal space (its optimal stimulus). Together these classifiers implement a mapping M from the distal shape space to the proximal representation space, Rk . To fulfill the requirements for veridical representation discussed in section 2, the response of each classifier must be approximately constant over the range of different viewing conditions z and must fall off gradually and monotonically with parameterspace distance from its stimulus, as in Shepard’s law of generalization. Such a system realizes a map M that is smooth and regular and can therefore serve

4

2

2

1.5

1.5

5

2

3

1

9 8

7

6

1

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−2

cross

−2 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 5: (Left) The parameter-space configuration used for generating the stimuli in one of the experiments described in Cutzu and Edelman (1996). (Middle) The two-dimensional MDS solution for the subject data. Symbols: ◦: true configuration; ×: configuration derived by MDS from the subject data, then Procrustes transformed (that is, translated, rotated, and scaled; see Borg & Lingoes, 1987) to fit the true one. Lines connect corresponding points. The residual (Procrustes) distance between the MDS-derived configuration and the true one was 0.66 (expected random value, estimated by bootstrap from the data: 3.14 ± 0.15, mean and standard deviation; 100 permutations of the point order were used in the bootstrap computation). (Right) The two-dimensional MDS solution for the radial basis function (RBF) ensemble model (see section 3.2); Procrustes distance: 1.11 (expected random value: 3.14±0.17). Simulations showed that the recovery of the low-dimensional structure from image-space distances between the stimuli (as opposed to their representation in the RBF ensemble space) was impossible. See Cutzu and Edelman (1996) for details.

Similarity, Connectionism, and Representation in Vision

713

as a substrate for veridical representation of the original parameter space (any diffeomorphism restricted to a compact subset of its domain is quasiconformal; Zorich 1992, p. 133). A model of this type fully replicated the psychophysical results mentioned in the preceding section. Specifically, the MDS-derived configurations obtained from the performance data of the model showed significant resemblance to the true parameter-space configurations (see Figure 5, right). The model consisted of a number of classifiers (three or four, depending on the experiment), each incorporating a nearly viewpoint-invariant representation of a reference object, formed by training a radial basis function (RBF) network to interpolate a characteristic function for that object (Poggio & Edelman, 1990). Each RBF network was trained to output a constant value for a selection of views of the reference object; the views were encoded by the activities of an underlying receptive field layer. As a result, the classifier responded nearly equally to different views of the reference object it had been trained on and progressively less to objects that differed from the reference one in the shape space (see Figure 6). At the RBF ensemble level, the similarity between two stimuli was defined as the cosine of the angle between the vectors of outputs they evoked in the RBF modules. 3.2.3 Extension to a Diverse Collection of Objects. The faithful representation of interobject similarities acquires a somewhat different meaning if objects that do not belong to a homogeneous class are considered. In that case, it is not reasonable to expect that a system reflect any “proper” pattern of similarities inherent in the stimuli, simply because no such pattern exists for a collection of objects that is diverse enough (cf. section 2.3.2). Note, however, that the requirement of discounting the effects of the extraneous variables (associated with viewpoint, illumination, etc.) still stands. According to this requirement, a representational system must place views of the same object closer together in its internal shape space than views of different objects. Figure 7 illustrates the ability of the multiple-classifier model to do just that. As a control, we first employed MDS to arrange four views of each of three objects (a human figure, a chess piece, and a candlestick) in two dimensions, according to their distances in the space spanned by a collection of receptive fields that covered the images. The left panel of Figure 7 shows that views in the resulting configuration were arranged by orientation of the object, not by object identity. As in the model of the psychophysical findings already described, the correct clustering (by object identity) was obtained when the distances submitted to MDS had been computed in the space of activities of three RBF classifiers, each trained on other views of one of the objects (see Figure 7, right). 3.2.4 More on the Multiple-Classifier Approach. In addition to its ability to replicate human performance in the perception of shape similarities, the

714

Shimon Edelman and Sharon Duvdevani-Bar

Figure 6: The response of a radial basis function module, trained on 10 random views of a parametrically defined object, to stimuli differing from a reference view of that object (marked by the large circle), in three ways: (1) by progressive view change, marked by ◦; (2) by progressive shape change, marked by ×; (3) by combined shape and view change, marked by ∗. The points along each curve have been sorted by pixel-space distance between the test and the reference stimuli (shown along the abscissa). Points are means over 10 repetitions with different random view-space and shape-space directions of change; a typical error bar (± standard error of the mean) is shown for each curve. Note the insensitivity of the module’s output to view-space changes, relative to shapespace changes.

multiple-classifier model described above possesses a number of other useful properties (Edelman, 1995b; Edelman, Cutzu, & Duvdevani-Bar, 1996). First, it is related to a powerful method of dimensionality reduction (Bourgain, 1985; Linial, London, & Rabinovich, 1994), in which points belonging to a multidimensional space are embedded into a space of much lower dimensionality, while preserving to a large extent the original interpoint distances. The importance of this relationship stems from the need for drastic dimensionality reduction that arises in biological visual systems, where the dimensionality of intermediate representation spaces (at the level of the optic nerve, for example) may be counted in the hundreds of thousands. Second, representing a stimulus by the responses of classifiers, each of which

Similarity, Connectionism, and Representation in Vision

715

Figure 7: (Left) The arrangement in two dimensions of different views of three objects, derived by MDS from distances computed in the space of activities of receptive fields covering the images (recall that the MDS procedure places views that are similar under the given metric nearby each other). Two hundred receptive fields, intended to serve as a crude model of the primate primary visual area, were used in this simulation. Note that the views are clustered by the orientation of the object. (Right) The arrangement of the same views according to their distances in the space spanned by the outputs of three RBF classifiers, each trained on one of the objects (see sections 3.2 and 3.2.3). Note the clustering by object identity.

is nonlinear (e.g., exponential, as in Shepard’s law) in the measurementspace distance between the stimulus and some reference object, introduces features polynomially related to the original measurement variables.3 This may simplify subsequent classification by nonlinearly enriching the representation space and making the possibility of linear separation between classes more likely (Cortes & Vapnik, 1995). Third, responses of a number of classifiers can serve as a substrate for carrying out classification at superordinate, basic, or subordinate levels of categorization, depending on the manner in which these responses are processed. For example, summing the responses of classifiers tuned to members of a certain class gives a measure of the affinity of the stimulus to that class (Nosofsky, 1988), while considering the responses separately can allow fine individual-level distinction between stimuli. Fourth, if the saliency of individual classifiers in distinguishing between various stimuli is kept track of and is taken into consideration depending on the task at hand, then similarity between stimuli in the representation space can be made asymmetrical and nontransitive, in accordance with Tversky’s general contrast model of similarity (Tversky, 1977).

3 Think of the Taylor expansion of e−kx−x0 k around x , where x is the stimulus and x 0 0 is a reference object.

716

Shimon Edelman and Sharon Duvdevani-Bar

It is interesting to compare the multiple-classifier approach to object representation to other recent proposals that avoid geometric reconstruction of individual objects. A typical example of the latter is representation by view interpolation. Because different views of a three-dimensional object are known to lie on a smooth low-dimensional manifold embedded in the space of all views of all objects (Ullman & Basri, 1991), it is possible to employ standard function approximation techniques such as RBFs to determine whether an image belongs to a given object, the latter being represented merely by a set of its views. Indeed, this technique is incorporated into each classifier module in our multiple-classifier system. Note, however, that interpolation is useless if the object is novel. In fact, an interpolation module cannot be trained to produce the “right output” for novel objects because such a thing is undefined. In contrast, the multiple-classifier system can still provide useful information in such a case (by signaling, e.g., the category affiliation of the stimulus rather than its identity, as suggested in the previous paragraph). In a sense, this is made possible by mimicking, on a higher level, the behavior of systems that carry out multiple feature measurements over the input image and then classify it according to the histograms computed for the different features (Swain & Ballard, 1991; Mel, 1996); in our case, the measurements are similarities to entire objects. We conjecture that a combination of the two approaches with top-down processes capable of anchoring the representation activated by a stimulus in the retinal and the world coordinate frames would lead to the development of more comprehensive theories of vision.

3.3 Neurophysiology. The approach to representation based on a smooth distal-to-proximal mapping and its implementation by the bank of classifiers, outlined in the preceding section, lead to explicit predictions regarding the processing of objects at the higher levels of the primate visual system. Specifically, one expects to find there units responding preferentially to certain objects, with the response falling off monotonically with dissimilarity between the stimulus and the preferred object, while staying nearly constant over different views of the preferred object (cf. Figure 6). Although first reports of cells in the monkey inferotemporal cortex that respond preferentially to faces date back more than two decades (Gross, Rocha-Miranda, & Bender, 1972), cells tuned to general objects have been found only recently. In particular, Tanaka and his group reported the desired selectivity for specific (mostly two-dimensional) objects in recordings from the inferotemporal cortex (Fujita, Tanaka, Ito, & Cheng, 1992; Kobatake & Tanaka, 1994; Tanaka, 1992, 1993). More recently, Logothetis, Pauls, and Poggio (1995) reported recordings from cells tuned to specific views of threedimensional objects on which the monkey had been trained. A small proportion of the object-tuned cells they found responded each to a limited subset of the objects, irrespective of view.

Similarity, Connectionism, and Representation in Vision

717

None of the above experiments involved parametric manipulation of the stimulus shape—a crucial component in testing the predictions of the theory of representation proposed here. In another study, where such manipulation was attempted, the stimuli were complex, parametrically defined, periodic two-dimensional patterns (Sakai, Naya, & Miyashita, 1994). In that study, the response of cells was found to decrease monotonically with parameter-space distance between the test stimulus and the preferred pattern to which the cells were tuned. With parametrically controlled threedimensional stimuli, it should be possible to look for cells that behave similarly to the RBF module whose behavior is illustrated in Figure 6. Such a cell would respond equally to different views of its preferred object, but its response would decrease with parameter-space distance from the point corresponding to the shape of the preferred object. Furthermore, the responses of a number of such cells, each tuned to a different reference object, should carry sufficient information for classifying novel stimuli of the same general category. Finally, if the pattern of stimuli has a simple low-dimensional characterization in some underlying parameter space (as in Figure 5, left), it should be recoverable from the ensemble response of a number of cells, using multidimensional scaling.

4 Conclusions A lawful correspondence between proximities of objects in an internal representation space and their distal similarities offers a principled solution to the Problem of Representation by showing that the internal space can mirror the visual world in a mathematically precise and behaviorally significant sense. This view of representation, derived from R. N. Shepard’s (1968) idea of second-order isomorphism between the representation space and the world, possesses a number of advantages over the simplistic alternative (“pictures in the head”). It also constitutes a plausible model of representation in primate vision, as indicated by a series of recent psychophysical results. We have seen that faithful representation by second-order isomorphism can be supported by a system of classifiers whose graded-profile overlapping “receptive fields” are tuned to specific object classes in a common parametrically defined shape space. In sensory physiology, the notion of functional units selectively and reliably responsive to external stimuli (units whose output covaries with the sensory input in a well-defined and predictable manner) has found a wide acceptance in the form of the feature detector doctrine (Barlow, 1979). A better understanding of the formal capabilities of a system of feature detectors in the context of shape vision and of its performance as a psychological model should help this doctrine evolve into a full-fledged theory of representation.

718

Shimon Edelman and Sharon Duvdevani-Bar

Acknowledgments Thanks to A. Aertsen, G. Cottrell, F. Cutzu, N. Intrator, D. Mumford, T. Poggio, R. Shepard, and S. Ullman for useful discussions and suggestions and to the two anonymous reviewers for constructive criticism. The experiments described in Figure 5 were carried out by F. Cutzu. K. Grill-Spector helped with the development of the shape-morphing procedures. S.E. holds the Sir Charles Clore Career Development Chair at the Weizmann Institute of Science. References Anderson, C. H., & Van Essen, D. C. (1987). Shifter circuits: A computational strategy for dynamic aspects of visual processing. Proceedings of the National Academy of Science, 84, 6297–6301. Barlow, H. B. (1979). The past, present and future of feature detectors. In D. Albrecht (Ed.), Recognition of Pattern and Form (pp. 4–32). Berlin: Springer. Borg, I., & Lingoes, J. (1987). Multidimensional similarity structure analysis. Berlin: Springer-Verlag. Bourgain, J. (1985). On Lipschitz embedding of finite metric spaces in Hilbert space. Israel J. Math., 52, 46–52. Cohn, H. (1967). Conformal mappings on Riemann surfaces. New York: McGrawHill. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297. Cummins, R. (1989). Meaning and mental representation. Cambridge, MA: MIT Press. Cummins, R. (1996). Representations, targets, and attitudes. Cambridge, MA: MIT Press. Cutzu, F., & Edelman, S. (1996). Faithful representation of similarities among three-dimensional shapes in human vision. Proc. Natl. Acad. Sci., 93, 12046– 12050. Edelman, S. (1995a). Representation of similarity in 3D object discrimination. Neural Computation, 7, 407–422. Edelman, S. (1995b). Representation, similarity, and the chorus of prototypes. Minds and Machines, 5, 45–68. Edelman, S., Cutzu, F., & Duvdevani-Bar, S. (1996). Similarity to reference shapes as a basis for shape representation. In G. Cottrell (Ed.), Proceedings of COGSCI’96, San Diego, CA, pp. 260–265. Fujita, I., Tanaka, K., Ito, M., & Cheng, K. (1992). Columns for visual features of objects in monkey inferotemporal cortex. Nature, 360, 343–346. Gallistel, C. R. (1990). The organization of learning. Cambridge, MA: MIT Press. Goldstone, R. L. (1994). The role of similarity in categorization: Providing a groundwork. Cognition, 52, 125–157. Gross, C. G., Rocha-Miranda, C. E., & Bender, D. B. (1972). Visual properties of cells in inferotemporal cortex of the macaque. J. Neurophysiol., 35, 96–111.

Similarity, Connectionism, and Representation in Vision

719

Holland, J. H., Holyoak, K. J., Nisbett, R. E., & Thagard, P. R. (1986). Induction: Processes of inference, learning, and discovery. Cambridge, MA: MIT Press. Intrator, N. (1993). Combining exploratory projection pursuit and projection pursuit regression. Neural Computation, 5, 443–455. Kobatake, E., & Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. J. Neurophysiol., 71, 2269–2280. Kruskal, J. B., & Wish, M. (1978). Multidimensional scaling. Beverly Hills, CA: Sage. Linial, N., London, E., & Rabinovich, Y. (1994). The geometry of graphs and some of its algorithmic applications. FOCS, 35, 577–591. Logothetis, N. K., Pauls, J., & Poggio, T. (1995). Shape recognition in the inferior temporal cortex of monkeys. Current Biology, 5, 552–563. Marr, D. (1982). Vision. San Francisco: Freeman. Mel, B. (1996). SEEMORE: Combining color, shape, and texture histogramming in a neurally-inspired approach to visual object recognition (Tech. Rep.). Los Angeles: University of Southern California. Nicod, J. (1930). Foundations of geometry and induction. London: Routledge & Kegan Paul. Nosofsky, R. M. (1988). Exemplar-based accounts of relations between classification, recognition, and typicality. Journal of Experimental Psychology: Learning, Memory and Cognition, 14, 700–708. Palmer, S. E. (1978). Fundamental aspects of cognitive representation. In E. Rosch & B. B. Lloyd (Eds.), Cognition and Categorization (pp. 259–303). Hillsdale, NJ: Erlbaum. Poggio, T., & Edelman, S. (1990). A network that learns to recognize threedimensional objects. Nature, 343, 263–266. Quine, W. V. O. (1969). Natural kinds. In W. V. O. Quine (Ed.), Ontological Relativity and Other Essays (pp. 114–138). New York: Columbia University Press. Quine, W. V. O. (1973). The roots of reference. La Salle, IL: Open Court. Reshetnyak, Y. G. (1989). Space mappings with bounded distortion. Providence, RI: American Mathematical Society. Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382–439. Sakai, K., Naya, Y., & Miyashita, Y. (1994). Neuronal tuning and associative mechanisms in form representation. Learning and Memory, 1, 83–105. Schwartz, E. L. (1985). Local and global functional architecture in primate striate cortex: outline of a spatial mapping doctrine for perception. In D. Rose, and V. G. Dobson (Eds.), Models of the Visual Cortex (pp. 146–157). New York: Wiley. Shepard, R. N. (1968). Cognitive psychology: A review of the book by U. Neisser. Amer. J. Psychol., 81, 285–289. Shepard, R. N. (1980). Multidimensional scaling, tree-fitting, and clustering. Science, 210, 390–397. Shepard, R. N. (1987). Toward a universal law of generalization for psychological science. Science, 237, 1317–1323.

720

Shimon Edelman and Sharon Duvdevani-Bar

Shepard, R. N., & Chipman, S. (1970). Second-order isomorphism of internal representations: Shapes of states. Cognitive Psychology, 1, 1–17. Shepard, R. N., & Kannappan, S. (1993). Connectionist implementation of a theory of generalization. In S. J. Hanson, J. D. Cowan, and C. L. Giles (Eds.), Advances in Neural Information Processing Systems 5 (pp. 665–672). San Mateo, CA: Morgan Kaufmann. Sugihara, T., Edelman, S., & Tanaka, K. (1996). Representation of objective similarity among 3D shapes in the monkey. Invest. Ophthalm. Vis. Sci. Suppl. (Proc. ARVO). Suppes, P., Pavel, M., & Falmagne, J. (1994). Representations and models in psychology. Ann. Rev. Psychol., 45, 517–544. Swain, M. J., & Ballard, D. H. (1991). Color indexing. International Journal of Computer Vision, 7, 11–32. Tanaka, K. (1992). Inferotemporal cortex and higher visual functions. Current Opinion in Neurobiology, 2, 502–505. Tanaka, K. (1993). Neuronal mechanisms of object recognition. Science, 262, 685– 688. Tversky, A. (1977). Features of similarity. Psychological Review, 84, 327–352. Ullman, S., & Basri, R. (1991). Recognition by linear combinations of models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 992–1005. V¨ais¨al¨a, J. (1971). Lectures on n-dimensional quasiconformal mappings. Berlin: Springer-Verlag. V¨ais¨al¨a, J. (1992). Domains and maps. In M. Vuorinen (Ed.), Quasiconformal Space Mappings (pp. 119–131). Berlin: Springer-Verlag. Zorich, V. A. (1992). The global homeomorphism theorem for space quasiconformal mappings. In M. Vuorinen (Ed.), Quasiconformal Space Mappings (pp. 132– 148). Berlin, Springer-Verlag.

Received October 6, 1995; accepted June 18, 1996.

ARTICLE

Communicated by David Mumford

Dynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual Cortex Rajesh P. N. Rao Dana H. Ballard Department of Computer Science, University of Rochester, Rochester, NY 14627-0226, USA

The responses of visual cortical neurons during fixation tasks can be significantly modulated by stimuli from beyond the classical receptive field. Modulatory effects in neural responses have also been recently reported in a task where a monkey freely views a natural scene. In this article, we describe a hierarchical network model of visual recognition that explains these experimental observations by using a form of the extended Kalman filter as given by the minimum description length (MDL) principle. The model dynamically combines input-driven bottom-up signals with expectation-driven top-down signals to predict current recognition state. Synaptic weights in the model are adapted in a Hebbian manner according to a learning rule also derived from the MDL principle. The resulting prediction-learning scheme can be viewed as implementing a form of the expectation-maximization (EM) algorithm. The architecture of the model posits an active computational role for the reciprocal connections between adjoining visual cortical areas in determining neural response properties. In particular, the model demonstrates the possible role of feedback from higher cortical areas in mediating neurophysiological effects due to stimuli from beyond the classical receptive field. Simulations of the model are provided that help explain the experimental observations regarding neural responses in both free viewing and fixating conditions. 1 Introduction Much of the information regarding the response properties of cells in the primate visual cortex has been obtained by displaying visual stimuli within a cell’s receptive field while a monkey fixates on a blank screen and observing, via microelectrode recordings, the neuronal responses elicited by the displayed stimuli (Hubel & Wiesel, 1968; Zeki, 1976; Baizer, Robinson, & Dow, 1977; Andersen, Essick, & Siegel, 1985; Maunsell & Newsome, 1987).

Note: A preliminary account of this work appeared as Rao and Ballard (1995b). Neural Computation 9, 721–763 (1997)

c 1997 Massachusetts Institute of Technology °

722

Rajesh P. N. Rao and Dana H. Ballard

The initial assumption was that the responses thus recorded are not substantially different from those generated in a more natural setting, but later experiments showed that the responses of many visual cortical neurons can be significantly modulated by stimuli from beyond the classical receptive field (Nelson & Frost, 1978; Zeki, 1983; Allman, Miezin, & McGuinness, 1985; Desimone & Schein, 1987; Guylas, Orban, Duysens, & Maes, 1987; Muller, Singer, Krauskopf, & Lennie, 1996). Modulatory effects in neural responses have also been recently reported by Gallant, Connor, and Van Essen (1994), Gallant, Connor, Drury, and Van Essen (1995), and Gallant (1996) in a task where a monkey freely views a natural scene. In these experiments, responses from the visual areas V1, V2, and V4 were recorded while a monkey freely viewed natural images. The image subregions that fell within a cell’s receptive field during free viewing were then extracted from the images and flashed within the cell’s receptive field while the monkey fixated on a blank screen. Surprisingly, the responses obtained when the stimuli were flashed during the fixation task were found to be generally much larger than when the same stimuli entered the receptive field during free viewing. In this article, we describe a hierarchical network model of dynamics and synaptic learning in the visual cortex that provides explanations for these experimental observations by exploiting two fundamental properties of the cortex: (1) the reciprocity of connections between any two connected areas (Rockland & Pandya, 1979; Van Essen, 1985; Felleman & Van Essen, 1991), and (2) the convergence of inputs from lower-area neurons to a higher-area neuron, resulting in larger receptive field sizes for neurons in the higher area (Van Essen & Maunsell, 1983; Desimone & Ungerleider, 1989). The reciprocity of connections allows feedback pathways to convey top-down reconstructed signals that serve as predictions for lower-level modules, while the feedforward pathways convey the residuals between current recognition state and the top-down predictions from the higher level. The larger receptive field size of higher-level neurons allows them to integrate information over a larger spatial extent, which in turn permits them to influence, via feedback pathways, lower-level units operating at a smaller spatial scale. The model acknowledges the fact that vision is an active, dynamic process subserving the larger cognitive goals of the organism that the visual system is embedded in. In particular, the dynamics of a cortical module at a given hierarchical level is shown, using the minimum description length (MDL) principle (Rissanen 1989; Zemel, 1994), to assume the form of an extended Kalman filter (Kalman, 1960; Kalman & Bucy, 1961; Maybeck, 1979), which optimally estimates current recognition state by combining information from input-driven bottom-up signals and expectation-driven top-down signals. The dynamics also allows a nonlinear prediction step that facilitates the processing of time-varying stimuli and helps to counteract the signal propagation delays introduced by the different hierarchical levels of the network. Prediction is mediated by a set of units whose weights can be adapted in order to minimize the prediction error. The transformation

Model Predicts Neural Response Properties

723

of signals from lower, less abstract areas to higher, more abstract areas is achieved via synaptic weights in the feedforward pathway while the transformation back is achieved via the feedback weights, both of which are learned during exposure to natural stimuli. This learning is mediated by a Hebbian learning rule also derived from the MDL principle. The weights are shown to approach the transposes of each other, which allows the fast dynamics of the Kalman filter to be efficiently implemented. The combined prediction-learning scheme can be viewed as implementing a form of the expectation-maximization (EM) algorithm (Baum, Petrie, Soules, & Weiss, 1970; Dempster, Laird, & Rubin, 1977). We first verify the hypothesis that reciprocal connections between different hierarchical levels can mediate robust visual recognition by assessing the model’s performance on a small training set of realistic objects. The model is shown to recognize objects from its training set even in the presence of partial occlusions and maintain a fine discrimination ability between highly similar visual stimuli. By exposing the network to natural images, we show that the receptive fields developed at the lowest two hierarchical levels are comparable to those of cells at the corresponding levels in the primate visual cortex. We then describe simulations of the model in both free viewing and fixating conditions and show that the simulated neural responses are qualitatively similar to those observed in the neurophysiological experiments of (Gallant et al., 1994, 1995; Gallant, 1996). We conclude by describing related work on cortical modeling, Kalman filters, and learning-parameter estimation techniques (including the relationship to the EM algorithm) and discuss the use of the model as a computational substrate for understanding a number of well-known psychophysical and physiological phenomena such as illusory contours, bistable percepts, amodal completions, visual imagery, end stopping, and other effects from beyond the classical receptive field. 2 A Simple Input Encoding Example We begin by examining a simple input encoding problem. In particular, consider the problem of encoding a collection of inputs I1 , I2 , . . . using a single-layer recurrent network (see Figure 1) with linear activation units where the goal is to minimize the reconstruction error. In other words, for a given n×1 input vector I, we would like to minimize the sum of componentwise squared errors as given by J(U, r) = (I − Ur)T (I − Ur),

(2.1)

where U denotes the n × k feedback matrix, the rows of which correspond to the synaptic weights of units in the feedback pathway, r denotes the current k×1 activation (or internal state) vector, and T denotes the transpose operator. The vector Ur corresponds to the network’s reconstruction I0 of the input I.

724

Rajesh P. N. Rao and Dana H. Ballard

Input Image

Feedforward Weights

W

r

I

State Vector

U Feedback Weights

Figure 1: A simple input encoder network. The goal is optimally to encode b and b b W, inputs I by finding best estimates U, r for the network parameters U, W, and r. Optimality is defined in terms of least squared reconstruction error where the reconstruction of the input is given by I0 = Ur.

Note that the optimization function J is modulated by two sets of parameters U and r. We can thus minimize J by performing gradient descent with respect to both of these parameters. For a given value of U, we can obtain the optimal estimate rˆ of the state vector r according to k1 ∂ J b = k1 UT (I − Ub r ), r˙ = − 2 ∂r

(2.2)

where k1 > 0 determines the adaptation rate for r. Equation 2.2 can be equivalently expressed in its discrete form as b r (t)). r(t + 1) = b r(t) + k1 UT (I − Ub

(2.3)

b of Similarly, for a given value of r, we can obtain the optimal estimate U the generative weight matrix U according to ˙ = − k2 ∂ J = k (I − Ur)r b T, b U 2 2 ∂U

(2.4)

where k2 > 0 determines the learning rate. Equivalently, T b b + 1) = U(t) b + k2 (I − U(t)r)r . U(t

(2.5)

It is interesting to note that the above rule can be regarded as a form of Hebbian learning, with presynaptic activity r and postsynaptic activity

Model Predicts Neural Response Properties

725

b The difference in the latter term can be computed, for instance, via (I − Ur). inhibitory feedback. Typically, for a given input I during learning, r = b r is b are updated. allowed to converge to a stabilized value before the weights U b are subsequently used for computing the stateb These new values for U r for the next input. The resulting scheme can thus be regarded as implementing an online form of the well-known EM algorithm (Dempster, Laird, & Rubin, 1977) (see section 9.5 for a discussion). Note that for appropriately small positive values of k1 , the quadratic form of J ensures convergence of b r to the global minimum of J for a given input I. An important feature of the above estimation scheme is that the dynamics in equation 2.3 can be readily implemented in the network of Figure 1 if we make the feedforward weight matrix W = UT . By using a learning rule for U and W T identical to equation 2.5 but with a linear weight decay term, it b converges approximately to U bT even when initialized to can shown that W arbitrary random values (see, for example, Williams, 1985). A number of variants of the estimation scheme sketched above have previously appeared in the literature in various guises. For example, Daugman (1988) proposes a gradient descent rule for estimating the coefficients of a fixed basis set of nonorthogonal Gabor filters (see also Pece, 1992). Daugman’s scheme can be seen to be essentially identical to equation 2.2 where the rows of UT comprise the Gabor filter set. On the other hand, rather than fixing the basis set of filters to be Gabor functions, one can attempt to learn the basis set for some fixed state vector r. In particular, if we let r = UT I (the feedforward response assuming W = UT ), equation 2.4 reduces to Oja’s (1989) subspace network learning algorithm and Williams’s (1985) symmetric error-correction learning rule, both of which were shown to generate orthogonal weight vectors that span the principal subspace (or dominant eigenvector subspace) of the input distribution. Thus, the purely feedforward form of the network performs an operation equivalent to principal component analysis (PCA) (Chatfield & Collins, 1980). More recently, learning schemes similar to the one we presented above have been proposed independently by Olshausen and Field (1996) and Harpur and Prager (1996). As above, these schemes combine the dynamics of the state vector r with the learning of the weights U. By adding an activity penalty (derived from a “sparseness function”) to the dynamics of the state vector r, Olshausen and Field (1996) demonstrate the development of nonorthogonal and localized basis filters from natural images. In the following sections, we show that the above framework can be regarded as a special case of a more general estimation and learning scheme that has its roots in stochastic modeling and Kalman filter theory. In particular, using a hierarchical stochastic model of the image generation process, we formulate an optimization function based on the MDL principle. In the MDL formulation, the activity penalty of Olshausen and Field (1996) arises as a direct consequence of the assumed model prior. As we show in

726

Rajesh P. N. Rao and Dana H. Ballard

the following sections, the resulting learning-estimation scheme allows the modeling of the visual cortex as a hierarchical Kalman predictor. 3 The Architecture of the Model The model is a hierarchically organized neural network wherein each intermediate level of the hierarchy receives bottom-up information from the preceding level and top-down information from a higher level (see Figure 2). At each level, the output signals of spatially adjacent modules, each processing its own local image patch, are combined and fed as input to the next higher level, whose outputs are in turn combined with those of its neighbors and fed into yet another higher level, until the entire visual field has been accounted for. As a result, the receptive fields of units become progressively larger as one ascends the hierarchical network in a manner similar to that observed in the occipitotemporal pathway (V1 → V2 → V4 → IT) of the primate visual system (Van Essen & Maunsell, 1983; Desimone & Ungerleider, 1989). At the same time, feedback pathways allow top-down influences to modulate the output of noisy lower-level modules. From a computational perspective, such an arrangement allows a hierarchy of abstract internal representations to be learned while simultaneously endowing the system with properties essential for dynamic recognition, such as the ability to form pattern completions during occlusions. From a neuroanatomical perspective, such an architecture falls well within the known complexity of cortico-cortical and intracortical connections found in the primate visual cortex (Rockland & Pandya, 1979; Gilbert & Wiesel, 1981; Lund, 1981; Jones, 1981; Douglas, Martin, & Whitteridge, 1989; Felleman & Van Essen, 1991). At the lowest level, the input image is processed in parallel by a large number of identical network modules that tessellate the visual field. For a given module, the local image patch can be considered an n-dimensional vector I, which is linearly filtered by a set of k modifiable filters as represented by the rows of a k×n feedforward matrix W of the synaptic weights of units. The synaptic weights comprising W are initialized to small random values and adapted in response to input stimuli according to the synaptic learning rules derived below. The image representation I in our model roughly corresponds to the representation at the lateral geniculate nucleus (LGN) while the filters defined by the rows of weight matrix W perform an operation similar to that carried out by simple cells in layer IV of V1. The filtered image is defined by a response vector y given by y = WI .

(3.1)

The response vector y is fed into k (possibly nonlinear) “state” units, which compute running estimatesb r of current internal state and predict the state r for the next time instant using their synaptic weights V (see Figure 2). There also exist n backprojection or feedback units with an associated n × k matrix

Model Predicts Neural Response Properties

Retinal Imaging Array

727

From Neighboring Modules -1 Σbu

W

(I - I td)

Top-Down Prediction

Level 0

-1 Σ td ( rtd - r )

-1 ( -1 Σ td

Σtd

r-r ) td

W

Σtd -1

P

(r-r )

^r

I I td = U r

P

U

td

V

f

Predicted State r

r td = U r To Neighboring Modules

Level 1

^r

r U

V

f

r

Predicted State r

Level 2

Figure 2: The architecture of the model. The feedforward pathways for the first two levels are shown in the top half of the figure, while the bottom half represents the feedback pathways. Each feedforward weight matrix W is approximately the transpose of its corresponding feedback matrix U. The feedback pathways carry top-down reconstructed signals rtd that serve as predictions for the lowerlevel modules. The feedforward pathways carry residuals between the current state and the top-down predictions, and these residuals are filtered through the feedforward weight matrix W at the next level. The current state prediction r(t) is computed from the previous state estimate b r(t − 1) using the prediction weights V and the activation function f . The current state estimate b r(t) at each level is continually updated by combining the top-down and bottom-up residuals with the current state prediction according to the extended Kalman filter update equations derived in the text. The figure illustrates the architecture for the simple case of four level 1 modules feeding into a level 2 module. However, this arrangement can be easily generalized in a recursive manner to the case of an arbitrary number of lower-level modules feeding into a higher-level module, whose outputs are in turn combined with those of its neighbors and fed into yet another higher-level and so on, until the entire visual field has been covered.

of “generative” weights U, which allow reconstruction of the input at the lower, less abstract level from the current internal state at the higher, more abstract level. Their output is given by Itd = Ur .

(3.2)

In addition to bottom-up input I, each level in the hierarchical network (except the highest level) also receives a top-down signal rtd from a higherlevel module. The signal rtd acts as a prediction of the lower-level module’s current state vector r, as estimated by a higher-level module after taking into account, for example, information from a larger spatial neighborhood in conjunction with the past history of inputs. Although the above equations were defined for the first hierarchical level, where the bottom-up input consisted of the image I(t), the same analysis can be applied to all levels of the network where the bottom-up input is simply

728

Rajesh P. N. Rao and Dana H. Ballard

the vector obtained by combining the output vectors of several neighboring lower-level modules, as illustrated in Figure 2 for the second level. Given the above architectural setting, we are now left with the problem of prescribing the dynamics for the internal state vectors b r and r, as well as appropriate learning rules for the synaptic weights U, V, and W. In the previous section, we achieved our objective by simply minimizing the reconstruction error. As we shall see, a more general statistical approach is to model the inputs to the system as being stochastically generated by a hierarchical image generation process that can be characterized as follows. First, we assume that images are being generated by a linear imaging model of the form I(t) = U(t)r(t) + nbu (t),

(3.3)

where U is a generative matrix (corresponding to the generative feedback weights in the network) and r is a hidden state vector. We assume the mean of the bottom-up noise process E(nbu (t)) = 0. The corresponding noise covariance matrix 6bu at time t is given by E[nbu (t)nbu (s)T ] = 6bu (t)δ(t, s),

(3.4)

where δ is the Kronecker delta function equaling 1 if t = s and 0 otherwise. We model the difference between the top-down prediction rtd from a higher level and the actual state vector r at the current level by another stochastic noise process ntd , r(t) = rtd (t) + ntd (t),

(3.5)

with E(ntd (t)) = 0. The top-down noise covariance matrix 6td at time t is given by E[ntd (t)ntd (s)T ] = 6td (t)δ(t, s).

(3.6)

Given the current state r(t), the transition to the state r(t + 1) at the next time instant is modeled as r(t + 1) = f (V(t)r(t)) + n(t),

(3.7)

where f is a (possibly nonlinear) vector-valued activation function, V is the state transition matrix (corresponding to the “prediction” weights in the network), and n is a stochastic noise process with mean and covariance: E[n(t)] = n(t). T

E[(n(t) − n(t))(n(s) − n(s)) ] = 6(t)δ(t, s).

(3.8) (3.9)

Model Predicts Neural Response Properties

729

For the derivations below, we assume that the three stochastic noise processes introduced above are uncorrelated over time with each other and with the initial state vector. It is convenient to view the n × k generative weight matrix U and the k × k prediction weight matrix V as an nk × 1 vector u and a k2 × 1 vector v, respectively, where the vectors are obtained by collapsing the rows of the matrix into a single vector—for example,  T  U1  T   U2   (3.10) u=  ..   .  UnT where Ui denotes the ith row of U. We characterize the evolution of the different unknown weights we used in the hierarchical imaging model above by the following dynamic equations, u(t + 1) = u(t) + nu (t)

(3.11)

v(t + 1) = v(t) + nv (t)

(3.12)

where nu and nv are stochastic noise processes with mean and covariances given by: E[nu (t)] = nu (t).

(3.13)

E[nv (t)] = nv (t).

(3.14)

E[(nu (t) − nu (t))(nu (s) − nu (s))T ] = 6u (t)δ(t, s). T

E[(nv (t) − nv (t))(nv (s) − nv (s)) ] = 6v (t)δ(t, s).

(3.15) (3.16)

In summary, equations 3.3 through 3.16 reflect our assumed hierarchical model of how visual inputs are being stochastically generated. The goal of a given cortical module is thus to estimate, as closely as possible, its counterpart in the input generation model as characterized by the system of equations above.1 It does so by minimizing a cost function based on the MDL principle. 4 The MDL-Based Optimization Function In order to prescribe the dynamics of the network and derive the learning rules for modifying the synaptic weight matrices, we make use of the 1 By convention, each estimate that the cortical module maintains will be differentiated from its counterpart in the input generation model by the occurrence of the hat symbol. For example, the mean of the current cortical estimate of the actual state vector r will be denoted byb r.

730

Rajesh P. N. Rao and Dana H. Ballard

MDL principle (Rissanen, 1989; Zemel, 1994), a formal information-theoretic version of the well-known Occam’s Razor principle commonly attributed to William of Ockham (thirteenth–fourteenth century A.D.) (Li & Vitanyi, 1993). Briefly, the goal is to encode input data in a way that balances the cost of encoding the data given the use of a model with the cost of specifying the model itself. The underlying motivation behind such an approach is to fit the model accurately to the input data while at the same time avoiding overfitting and allowing generalization. Suppose that we have already computed a prediction r of the current state r based on prior data. In particular, let r be the mean of the current state vector before measurement of the input data at the current time instant. The corresponding covariance matrix is given by E[(r − r)(r − r)T ] = M.

(4.1)

Similarly, let u and v be estimates of the current weights u and v calculated from prior data with E[(u − u)(u − u)T ] = S. T

E[(v − v)(v − v) ] = T .

(4.2) (4.3)

Given a description language L, data D, and model parameters M, the MDL principle advocates minimizing the following cost function (Zemel, 1994): J(M, D) = |L(M, D)| = |L(D|M)| + |L(M)|.

(4.4)

| · | denotes length of the description. In our case, D consists of the current input image I(t) and top-down input rtd (t), and M consists of the parameters U, V, and r that can be modulated.2 Thus, |L(D|M)| = |L(I − Ur)| + |L(r − rtd )|.

(4.5)

The two terms on the right-hand side of the preceding equation are simply the bottom-up and top-down reconstruction errors. The cost of the model is given by: |L(M)| = |L(r)| + |L(u)| + |L(v)|.

(4.6)

In a Bayesian framework, equation 4.6 can be regarded as taking into account the prior model parameter distributions, while equation 4.5 represents the 2 The feedforward weight matrix W does not appear among the model terms since this matrix can be constrained to be the transpose of the feedback weight matrix U as a result of the synaptic learning rules described in section 6.

Model Predicts Neural Response Properties

731

negative log likelihood of the data given the model parameters. Minimizing the complete cost function J is thus equivalent to maximizing the posterior probability of the model M given data D. Given the true probability distribution (over discrete events) of the various terms in the above equations, the expected length of the optimal code for each term is given by Shannon’s optimal coding theorem (Shannon, 1948), |L(x)| = − log P(X = x),

(4.7)

where P(X = x) denotes the probability of the discrete event x. Since the true distributions are unknown, we appeal to the central limit theorem (Feller, 1968) and use multivariate gaussians as the prior distributions for coding the various terms above. However, encoding from a continuous distribution requires the calculation of the probability mass of a particular small interval of values around a given value for use in equation 4.7 (Nowlan & Hinton, 1992; Zemel, 1994). Using a trapezoidal approximation, we may estimate the mass under a continuous (in our case, gaussian) density p in an interval of width w around a value x to be P(X = x) ∼ = p(x)w. For encoding the errors (see equation 4.5), we assume w to be a constant infinitesimal width that yields (using equation 4.7 and ignoring the constant terms due to the coefficients of the multivariate gaussians), −1 −1 (I − Ur) + (r − rtd )T 6td (r − rtd ). |L(D|M)| = (I − Ur)T 6bu

(4.8)

For encoding the model terms (see equation 4.6), a constant infinitesimal width w may be inappropriate since some values of the parameters may need to be encoded more accurately than others. For example, the network may be more sensitive to small changes in some parameter values than others (Nowlan & Hinton, 1992). One solution, suggested by Nowlan and Hinton (1992), is to make w inversely proportional to the curvature of the cost function surface, but this may be computationally very expensive. A simpler alternative, which favors small values for the parameters given the choice between small and large values, is to let w be a zero mean radially symmetric gaussian for each of the model terms u, v, and r with variances inversely proportional to α, β, and γ , respectively. Using equation 4.7, the model cost then reduces to (ignoring the constant terms) |L(M)| = (r − r)T M−1 (r − r) + (u − u)T S−1 (u − u) + (v − v)T T−1 (v − v) + γ rT r + αuT u + βvT v.

(4.9)

The first three terms in the sum above arise from the prior gaussian densities for r, u, and v as given by G(r, M), G(u, S), and G(v, T), while the latter three terms are derived from the zero mean gaussians associated with w. Note that the above equation implements an intuitively appealing trade-off between achieving a good fit with respect to prior predictions (r, u, and v) and

732

Rajesh P. N. Rao and Dana H. Ballard

simultaneously penalizing large values for the parameters (activities r and weights u and v). One may also view the latter three terms as regularizers (Poggio, Torre, & Koch, 1985) that help prevent overfitting of data, thereby increasing the potential for generalization.3 5 Network Dynamics and Optimal Estimation of Recognition State The derivation of the estimation algorithm for recognition state is analogous to that for the discrete Kalman filter as given in standard texts such as Bryson and Ho (1975) (the continuous case follows in a straightforward fashion by applying a few limits to the difference equations in the discrete case; Bryson & Ho, 1975). Given new input I and top-down prediction rtd , the optimal estimate of current state r after measurement of the new data is a stochastic code whose meanb r and covariance P can be obtained by setting ∂ J(M, D) = 0, ∂r

(5.1)

which implies −1 −1 r − r) − UT 6bu (I − Ub r ) − 6td (rtd −b r ) + γb r = 0. M−1 (b

(5.2)

After some algebraic manipulation, equation 5.2 yields the following update rule for the mean of the optimal stochastic code at time t, −1 −1 b (I − Ur(t)) + P6td (rtd − r(t)) − γ Pr(t), r(t) = r(t) + PUT 6bu

(5.3)

where the prediction r is given by r(t + 1) = f (V(t)b r(t)) + n(t).

(5.4)

It follows that the covariance matrix P at time t is given by the update rule −1 −1 (t)U + 6td (t) + γ I)−1 , P(t) = (M−1 (t) + UT 6bu

(5.5)

where I is the k × k identity matrix. The prediction error covariance matrix M is updated according to M(t + 1) =

µ ¶T ∂f ∂f P(t) + 6(t), ∂r ∂r

(5.6)

3 The variance-related regularization parameters α, β, and γ may be chosen according to a variety of techniques (see Girosi, Jones, & Poggio, 1995, for references) but for the experiments in this article, we used appropriately small but arbitrary fixed values for these parameters.

Model Predicts Neural Response Properties

733

with the partial derivative being evaluated at r = b r(t). The partial derivatives arise from a first-order Taylor series approximation to the activation function f aroundb r(t). The above set of equations can be regarded as implementing a form of the extended Kalman filter (Maybeck, 1979). During implementation of the algorithm, the covariance matrices 6, 6bu , b 6 bbu , and 6 btd estimated from and 6td can be approximated by matrices 6, sample data (see Rao (1996) for further details). Further, by appealing to a version of the EM algorithm (see section 9.5), we may use U = U and V = V in the equations above, where U and V are learned estimates whose update rules are described in section 6 and the appendix, respectively. For neural implementation, the update equations above can be simplified considerably by noting that the basis filters that form the rows of the feedforward matrix T W (∼ = U ; see the learning rule below) also approximately decorrelate their input, thereby effectively diagonalizing the noise covariance matrices (see also Pentland, 1992, for related ideas). This allows the recursive update rules to be implemented locally in an efficient manner. Equation 5.3 forms the heart of the dynamic state estimation algorithm. Note that equation 2.3 which was derived via gradient descent in section 2, is actually a special case of equation 5.3. More importantly, equation 5.3 implements an efficient trade-off between information from three different sources: the system prediction r(t), the top-down prediction rtd , and the bottom-up data I. Intuitively, the bottom-up and top-down gain matrices T −1 b −1 can be interpreted as signal-to-noise ratios. Thus, when b and P6 PU 6 bu td bbu is high (for instance, due to occlusions), the bottom-up noise variance 6 the bottom-up term (I − Ur(t)) is given less weight in the state estimation step (due to a lower gain matrix) and the estimate b r(t) relies more on the top-down term (rtd − r(t)) and the system prediction r(t). On the other hand, btd is high (for instance, due to ambiguity when the top-down noise variance 6 in interpretation by the higher-level modules), the estimateb r(t) relies more heavily on the bottom-up term (I − Ur(t)) and the system prediction r(t). Finally, if the system prediction r(t) has a large noise variance, the matrix P assumes larger values, which implies that the system relies more heavily on the top-down and bottom-up input data rather than on its noisy prediction r(t). The dynamics of the network thus strives to achieve a delicate balance between the current prediction and the inputs from various cortical sources by exploiting the signal-to-noise characteristics of the corresponding input channels. Note that by keeping track of the degree of correlations between units at any given level, the covariance matrices also dictate the degree of lateral interactions between the units as determined by equation 5.3. In summary, the optimal estimate of the current state before measurement of current data is a stochastic code with mean r and covariance M. After measurement and update as given by equations 5.3 and 5.5, the mean and covariance becomeb r and P, respectively. The meanb r is updated by filtering the current residuals or “innovations” (Maybeck, 1979) (I − Ur) and

734

Rajesh P. N. Rao and Dana H. Ballard

T −1 b −1 , respectively. Note that this requires the b and P6 (rtd − r) using PU 6 bu td T

existence of the matrix U for filtering the bottom-up residual (I−Ur) that is being communicated from the lower level. This computation can be readily implemented if the feedforward weights W happen to be symmetric with T

the feedback weights U, that is, W = U . Therefore, we need to address the question of how such feedforward and feedback weights can be developed. 6 Learning the Feedforward and Feedback Synaptic Weights The previous section described one method of minimizing the optimization r function J(M, D): updating the estimate of the current state r as given byb and P in response to the input data. However, J can be further minimized by additionally adapting the weights u and v, and the feedforward weights w as well. Here, we derive the learning rules for u and w. A possible learning procedure for the prediction weights v is described in the appendix. Note that in order to achieve stability in the estimation of state r, the weights need to be adapted at a much slower rate than the dynamic process that is estimating r. We first derive the weight update rule for estimating the backprojection (generative) weights u. Notice that (I − Ur) = (I − Ru),

(6.1)

where R is the n × nk matrix given by    R= 

rT 0 .. . 0

0 rT .. . ...

... ... .. . 0

0 0 .. . rT

   . 

(6.2)

Differentiating J with respect to the backprojection weights u and setting the result to 0, ∂ J(M, D) = 0, ∂u

(6.3)

we obtain, after some algebraic manipulation, the following update rule for the mean of the optimal stochastic weight vector at time t (note that we use r = r(t) as prescribed by a version of the EM algorithm; see section 9.5), −1 b (I − R(t)u(t)) − αPu u(t), u(t) = u(t) + Pu R(t)T 6bu

(6.4)

where u(t + 1) = b u(t) + nu (t) and R(t) is the matrix obtained by replacing r with r(t) in the definition of R above. The covariance matrices are updated

Model Predicts Neural Response Properties

735

according to −1 (t)R(t) + αI)−1 Pu (t) = (S−1 (t) + R(t)T 6bu

S(t + 1) = Pu (t) + 6u (t),

(6.5) (6.6)

where I is the nk × nk identity matrix. Note that equation 6.4 is simply a Hebbian learning rule with decay, the presynaptic activity being the current responses R(t) and the postsynaptic activity being the residual (I − R(t)u(t)) (see the feedback pathway in Fig. 2). It is relatively easy to see that equation 2.5, which we derived via gradient descent in section 2, is in fact a special case of equation 6.4 above. Thus, as we mentioned in section 2, for the case of feedforward processing, where r = WI T

and U = W , this learning rule (without the covariance matrices) becomes identical to Williams’s (1985) symmetric error-correction learning rule and Oja’s (1989) subspace network learning algorithm, both of which perform an operation equivalent to PCA (Chatfield & Collins, 1980). The more general form above that additionally incorporates dynamics for the state vector allows the development of nonorthogonal and local representations as well. Another interesting feature of the above learning rule is the transition from b u(t) to u(t + 1). This step allows the modeling of intrinsic neuronal noise via the term nu (t). By assuming a distribution for this noise term (for example, a zero-mean gaussian distribution) and sampling from this distribution, the transition from b u(t) to u(t + 1) can be made stochastic. This helps the learning algorithm avoid local minima and facilitates the search for a global minimum. The weight decay term −αPu u(t) (due to the model term |L(u)| in the MDL optimization function) penalizes overfitting of the data and helps increase the potential for generalization. Having specified a learning rule for the feedback weights, we need to formulate one for the feedforward weights such that the dynamics expressed by equation 5.3 is satisfied. In other words, the feedforward weight matrix W should be updated so as to approximate the transpose of the feedback matrix U. One solution is to assume that the feedforward weights are initialized to the same random values as the transpose of the feedback weights and then to apply the rule given in equation 6.4. However, it is highly unlikely that biological neurons can have their synapses initialized in such a convenient manner. A more reasonable assumption is that the feedforward weights are initialized independent of the feedback weights. Let w represent the vectorized form of the matrix W T analogous to equation 3.10. Consider the following weight modification rule for the feedforward weights at time t, −1 b (t) = w(t) + Pu R(t)T 6bu (I − R(t)u(t)) − αPu w(t), w

(6.7)

b (t) + nw (t) with nw (t). This is again a form where we use w(t + 1) = w of Hebbian learning (with decay), the presynaptic activity this time being

736

Rajesh P. N. Rao and Dana H. Ballard

the residual image and the postsynaptic activity being the current r (see the feedforward pathway in Figure 2). It is relatively easy to show that the above rule causes the feedforward matrix to approach the transpose of the ∼ UT , even when they start from different initial feedback matrix, that is, W = conditions. Define a difference vector d as b (t). d =b u(t) − w

(6.8)

Then, from equations 6.4 and 6.7, d = (u(t) − w(t)) − αPu (u(t) − w(t)).

(6.9)

b (t), (u(t) − w(t)) → d and u(t) and w(t) → w Asymptotically, as u(t) → b equation 6.9 reduces to −αPu d ∼ = 0.

(6.10)

Thus, given that α > 0 and Pu is positive definite, the expected value of d approaches zero asymptotically, which implies u ∼ = w. In other words (since T T ∼ w is W by definition), W = U , which satisfies the condition needed for the dynamics of the network (see equation 5.3) to converge to the optimal response vector for a given input. 7 Experimental Results Before presenting simulation results for the free viewing and fixation experiments, we verify the viability of the model by describing results from two related experiments. 7.1 Experiment 1: Learning and Recognizing Simple Objects. The first experiment was designed to test the model’s performance in a simple recognition task. A hierarchical network as shown in Figure 2 was used with four level 1 modules, each processing its local image patch, feeding into a single level 2 module. The four level 1 feedforward modules each contained 10 units (rows of W), and the second-level module contained 25. The constants α and γ that determine the variances of the MDL model terms were set at 0.02 and 0.1, respectively. The feedforward and feedback matrices were trained by exposing the network to gray scale images of five objects as shown in Figure 3A. Each input image was of size 128 × 128 and was split into four equal subimages of size 64 × 64 that were fed to the four level 1 modules. The average gray-level values were subtracted from each image and the resulting image normalized to unit length before being input to the network. For each input training image, the entire network was allowed to stabilize before the weights were updated. Since only static images were

Model Predicts Neural Response Properties

737

used, we used r(t + 1) = b r(t). Note that even without a nonlinear prediction step, for an arbitrary set of possibly nonorthogonal weight vectors, the optimal response of the network at a given hierarchical level remains a highly nonlinear function of the input image that cannot be computed by a single feedforward pass but rather must be recursively estimated by using equation 5.3 until stabilization is achieved (cf. Daugman, 1988). Figures 3B through E illustrates the response of the trained network to various input images. 7.2 Experiment 2: Learning Receptive Fields from Natural Images. The previous section demonstrated the model’s ability to learn internal representations for various man-made objects and subsequently use these representations for robust recognition. In this section, we ask: What internal representations does the network develop if it is exposed to a sequence of arbitrary natural images? This question becomes especially relevant if the network is to be used as a model of the visual cortical processing during, for example, free viewing of natural images, and therefore the internal representations (such as the weights W) developed by the model will need to be at least qualitatively similar to corresponding representations in the cortex that have been reported in the literature. We once again employed a hierarchical network as shown in Figure 2 with four level 1 modules feeding into a single level 2 module. The four level 1 feedforward modules each contained 20 units (rows of W), and the level 2 module contained 25 rows. Eight gray-scale images of natural outdoor scenes were used for training the network. However, unlike the previous experiment, the receptive fields of the four level 1 modules were allowed to overlap as shown at the right of Figure 4, in order to model qualitatively the overlapping organization of receptive fields of neighboring cells in the cortex (Hubel, 1988). Thus, for each natural image, a 20 × 20 image patch was chosen, and four overlapping 16 × 16 subimages, offset by four pixels horizontally and/or vertically from each other, were fed as input to the lower-level modules. Each of the subimages was preprocessed by subtracting the average gray-level value from each pixel to approximate the removal of DC bias at the level of the LGN. The resulting subimage was windowed by a 16 × 16 gaussian in order to simulate a gaussian input dendritic field. A total of 12,114 image patches were used for training the hierarchical network. For each image patch, the network was allowed to converge to the different optimal response vectorsb r for the different modules at each level T

before the corresponding weights (W = U ) were updated. The variancedependent constants α and γ were set to 0.02 and 0.1, respectively, and r(t), as in the previous the temporal prediction step was simply r(t + 1) = b section. Figure 4 shows feedforward synaptic weights W for the level 1 (equivalent to V1) and level 2 (V2) units that were learned using equation 6.7. The

738

Rajesh P. N. Rao and Dana H. Ballard

A Input to Network

Top-Down Prediction

Residual

B

C

D

E

Figure 3: Results from experiment 1: Recognition of simple objects. (A) The five objects used for training the hierarchical network. The first image additionally shows how a given image was partitioned into four local subimages that were fed to the four corresponding level 1 modules. (B) through (E) illustrate the response of the trained network to various input images. (B) When a training image is input, the network predicts an almost perfect reconstructed image, resulting in a residual that is almost everywhere zero, which indicates correct recognition. (C) If a partially occluded object from the training set is input, the unoccluded portions of the image together contribute, via the level 2 module, to predict and fill in the missing portions of the input. (D) When the network is presented with an object that is highly similar to a trained object, the prediction is that of the closest resembling object (the car in the training set). However, the large residual allows the network to judge the input as a new object rather than classifying it as the training object and incurring a false-positive error, a common occurrence in most purely feedforward systems. (E) shows that a completely novel object results in a prediction image that is an arbitrary mixture of the training images and as such, generates large residuals. The system can then choose to learn the new object using equations 6.4 and 6.7, or ignore the object if it happens to be behaviorally irrelevant.

Model Predicts Neural Response Properties

739

A

B

C

Figure 4: Results from experiment 2: Learning receptive fields from natural images. (A) Three of the eight images used in training the weights. The boxes on the right show the size of the first- and second-level receptive fields, respectively, on the same scale as the images. Four overlapping gaussian windowed image patches extracted from the natural image were fed to the level 1 modules, shown at the extreme left in Figure 2. The responses from the four level 1 modules were in turn combined in a single vector and fed to the level 2 module. The effective level 2 receptive field thus encompasses the image region spanned by the four overlapping level 1 receptive fields as shown on the right in (C). (B) Ten of the 20 receptive fields (or spatial filters) as given by the rows of W in a level 1 module. These resemble classical oriented edge-bar detectors, which have been previously modeled as difference of offset gaussians or Gabor functions. (C) Six of 20 receptive fields as given by the rows of W at level 2.

level 1 receptive fields resemble nonorthogonal wavelet-like edge-bar detectors at different orientations similar to the receptive fields of simple cells in V1 (Hubel & Wiesel, 1962, 1968; Palmer, Jones, & Stepnoski, 1991; see also Olshausen & Field, 1996; Harpur & Prager, 1996; Bell & Sejnowski, in press). These have previously been modeled as two-dimensional Gabor functions (Daugman, 1980; Marcelja, 1980) or difference of gaussian operators (Young, 1985). The level 2 receptive fields were constructed by propagating a unit impulse response for each row of the weight matrix W at the second level to level 0. These receptive fields appear to be tuned toward more complex shapes such as corners, curves, and localized edges. The existence of cells coding for features more complex than simple edges and bars have also been reported in several electrophysiological studies of V2 (Hubel & Wiesel, 1965;

740

Rajesh P. N. Rao and Dana H. Ballard

Baizer et al., 1977; Zeki, 1978). Thus, both the first- and second-level units in the model develop internal representations that are comparable to those of their counterparts in the primate visual cortex. 8 Simulations of Free Viewing and Fixation Experiments The experimental results from the previous section suggest that the model may provide a useful platform for studying cortical function by demonstrating the model’s efficacy in a simple visual recognition task and showing that the learning rules employed by the model are capable of developing receptive fields that resemble cortical cell receptive field weighting profiles. In this section, we return to the question posed in the introduction: Why do the responses of some visual cortical neurons differ so drastically when the same image regions enter the neuron’s receptive field in free viewing and fixation conditions (Gallant et al., 1994, 1995; Gallant, 1996)? A possible answer is suggested by the recognition results in Figure 3C. In particular, the figure shows that the responses of neurons depend not only on the image in the cell’s receptive field but also on the images in the receptive fields of neighboring cells. This suggests that a cell’s response may be modulated to a considerable extent by the surrounding context in which the cell’s stimulus appears, due to the presence of top-down feedback in conjunction with lateral interactions mediated by the covariance matrices b −1 (see equation 5.3). b −1 , and 6 P, 6 bu td For comparison with Gallant et al.’s (1995; Gallant, 1996) electrophysiological recording results, we assumed that the cell responses recorded correspond to the residual (r − rtd ) in the model (see Fig. 2).4 Qualitatively similar results are obtained by measuring the filtered bottom-up residual. For the simulation experiments, we modeled cells in V1 and V2 as shown in Figure 2 (with four level 1 modules feeding into a single level 2 module), but

4 The primate visual cortex is a laminar structure that can be divided into six layers (Felleman & Van Essen, 1991). The supragranular pyramidal cells (e.g., layer III) typically project to layer IV of the next higher visual area, while infragranular pyramidals (e.g., layer V/VI) send axons back to layers I, V, and/or VI of the preceding lower area (Van Essen, 1985; Felleman & Van Essen, 1991). The architecture of the model (see Fig. 2) suggests that the supragranular cells may carry the residual (r − rtd ) to the next level, while the infragranular cell axons could convey the current top-down prediction Ur (= Itd or rtd ) to the lower level. The bottom-up residual (I − Ur) would then be filtered by layer IV cells (whose synapses encode W) with the help of interneurons, which implement the lateral r(t) presumably would be computed by interactions due to the covariances. The estimateb infragranular cells (whose synapses could encode V) during the course of predicting r(t + 1); the presence of recurrent axon collaterals (Lund, 1981) among these cells is especially suggestive of such a recursive computation. These functional interpretations of cortical circuitry are, however, speculative and require rigorous experimental confirmation.

Model Predicts Neural Response Properties

741

the model can be readily extended to more abstract higher visual areas with a higher degree of convergence at each level. The parameters were set to the same values as in section 7.2. The feedforward and feedback weights were set equal to the values learned during prior exposure to natural images as described in section 7.2. The free viewing experiments were simulated by making random eye movements to image patches within a given natural image and allowing the entire network to converge until all cell responses stabilized; the response values (as given by the residual) were then recorded. For determining the response of a cell during the fixation task, only the image region that fell in the cell’s receptive field during free viewing was displayed and the cell’s response noted. The cell responses recorded were numerical quantities that could be used to generate spikes in a more detailed neuron model. Since the eye movements were assumed to be random, the temporal prediction step involving V was not used in these simulations, but we nevertheless expect it to play an important role in future simulations that employ temporal as well as spatial context in predicting stimuli. Figure 5 shows the simulation results. As noted above, the response of a given cell at any level is determined not just by the image falling within its receptive field but also by the images in the receptive fields of neighboring cells since the neighboring modules feed into the higher level, which conveys the appropriate top-down prediction back to the lower level: the better the top-down prediction rtd , the smaller the residual (r − rtd ) and, hence, the smaller the response. In the case of free viewing an image, the neighboring modules play an active role in contributing to the top-down prediction for any cell, and the existence of this larger spatial context allows for smaller responses in general. This is shown in Figure 5A where the given cell’s response initially increases until appropriate top-down signals cause the response to diminish to a stabilized level. On the other hand, in the fixation task, only the image subregion that fell in a cell’s receptive field during free viewing is displayed on a blank screen. The absence of any contextual information results in a much less accurate top-down prediction due to the lack of contributions from neighboring modules and, hence, a relatively larger response is elicited, as illustrated in Figure 5B. The histograms of responses in the simulations of fixation and free viewing conditions are shown in Figure 5C. The results show that in most, but not all, cases, the responses in free viewing are substantially reduced. This is consistent with the free viewing data (Gallant, 1996) and the data from accompanying control experiments (see below). Perhaps the most interesting feature of the free viewing versus fixation histograms is that they overlap, suggesting that the reduction in responses is not a binary all-or-none phenomenon but rather one that is dependent on the specific ability of the higher-level units to predict the current state at the lower level.

742

Rajesh P. N. Rao and Dana H. Ballard

Response of Simulated Neuron

0.4

0.4

Free Viewing Response

0.35

Fixation Response

0.35

0.3

0.3

0.25

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

0

0 0

20

40

60

80

100

120

0

20

40

Time (Number of Iterations)

60

80

100

120

Time (Number of Iterations)

B

A 9000 8000

Free Viewing Responses Fixation Responses

7000

Frequency

6000 5000 4000 3000 2000 1000 0 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Response Magnitude

C

Figure 5: Free viewing and fixation simulation results. (A) An example of the dynamic response of a cell to an image patch in the simulated case of free viewing. (B) The response of the same cell to the same image patch but under conditions of fixation. (C) The histograms of responses accumulated over 20 cells in a first-level module for 500 image presentations, showing the fixation responses (light gray) compared to the free viewing responses (black). Responses obtained during free viewing were found to be in general much diminished than were those obtained during the fixation task. As the graph depicts, more than 80 percent of the free viewing responses lie within the first bin in the histogram; the fixation distribution, on the other hand, has a much longer tail than the free viewing distribution.

Model Predicts Neural Response Properties

743

9 Discussion In related experiments (Gallant et al., 1995), it was observed that responses were sometimes also suppressed by enlarging the stimulus beyond the cell’s receptive field in the fixation task. More recently, Gallant has shown that in many cases, the response attenuation observed during free viewing can be duplicated in a static fixation task in which a review “movie” recreating the spatiotemporal sequence of image patches that entered the receptive field during free viewing is presented to the monkey, the review stimuli being three times the size of the classical receptive field (Gallant, 1996). The results from both of these experiments, which do not involve any eye movements, lend support to our explanation of the suppression of responses as occurring largely due to contributions from neighboring modules in response to the global input stimuli, rather than being the result of oculomotor predictive mechanisms (such as the “shifter circuit” of Anderson & Van Essen, 1987). However, this does not preclude the possibility of suppression due to oculomotor predictions occurring in other visual cortical areas, especially those in the parietal cortex, which are connected to oculomotor structures and where such predictive remapping effects have been observed (Duhamel, Colby, & Goldberg, 1992). A number of neurophysiological effects due to stimuli from beyond the classical receptive field can be explained within the context of the present model as occurring due to the influence of top-down predictions. Consider, for example, the classical phenomenon of end stopping, or suppression of responses when the stimulus (e.g., a line) extends beyond the cell’s receptive field (Hubel & Wiesel, 1965, 1968). This is clearly a highly “nonlinear” property of the “hypercomplex” cell (Hubel & Wiesel, 1965) if one considers the cell in isolation; however, the model suggests that such “nonlinearities” may in fact arise due to cortico-cortical, geniculo-cortical, and intracortical interactions between linear cells as well. An example of this phenomenon is Figure 3C, where a neural unit participates in the completion of an occluded image: The unit’s response appears to be a highly nonlinear function of the input image patch even though its activation function is linear. In the case of end stopping, the present model predicts that the ends are “stopped” via inhibitory feedback whenever the surrounding and encompassing receptive fields can predict the cell’s local input. The behavior of the cell appears nonlinear only if the cell is viewed in isolation without regard to possible interactions between its neighbors within a level and across levels (as suggested by the covariance terms and top-down feedback terms, respectively, in equation 5.3).5 Indeed, experiments by Murphy and Sillito (1987) confirm 5 Note that this does not preclude the existence of nonlinear neurons in the cortex. Nonlinearities (within the dynamic range of a cortical neuron) are accounted for in two ways by the present model: (1) single units in the model code both positive and negative quantities, whereas the cortex may employ multiple neurons for this purpose, thereby

744

Rajesh P. N. Rao and Dana H. Ballard

the contribution of cortical feedback in mediating end inhibition in cells in the dorsal LGN. In addition, Bolz and Gilbert (1986) have shown that supragranular cells in the cat striate cortex (V1) lose the property of end-stopping when infragranular cells (in particular, cells in layer VI) are inactivated by using the inhibitory transmitter gamma-aminobutyric acid (GABA). Since infragranular cells in V1 receive feedback from higher extrastriate visual cortical areas (Van Essen, 1985; Felleman & Van Essen, 1991), it is reasonable to assume that inactivation of these cells prevents top-down feedback from influencing the supragranular cells, thereby causing them to lose their property of end stopping. Note that other properties of the supragranular cells such as orientation selectivity were found to remain intact, which is in agreement with our model. Further evidence for the fundamental role of top-down feedback in influencing neural activity at lower cortical levels is provided by Mignard and Malpeli (1991), who show that cells in the upper layers of V1 can be driven in the absence of direct bottom-up input from the LGN. However, additionally destroying V2 profoundly reduced the upper layer neural activity, which led them to conclude that the cause of the upper layer activity was in fact top-down feedback from V2. Given the above observations, it may be possible to attribute other types of “nonspecific suppression” in V1 cells, such as cross-orientation inhibition (DeAngelis, Robson, Ohzawa, & Freeman, 1992) to bottom-up–top-down cortico-cortical and lateral interactions such as those occurring in equation 5.3 rather than to purely normalization-based suppression mechanisms (Heeger, Simoncelli, & Movshon, 1996). Also, in the light of the above discussion, the model predicts that much less response suppression should be observed in Gallant et al.’s free-viewing experiments if top-down feedback is precluded, for instance, by intracortical pharmacological means as in the experiments of Bolz and Gilbert (1986) or via destruction of higher cortical areas as in the experiments of Mignard and Malpeli (1991). Suppression of neural responses has also been observed in higher visual cortical areas such as MT (Allman et al., 1985), V4 (Desimone & Schein, 1987; Muller et al., 1996), and IT (Miller, Li, & Desimone, 1991). Many of these experiments employ time-varying stimuli such as drifting gratings, and we would therefore expect the responses to be modulated with respect to not just spatial context (as in this article) but also recent temporal context. This would necessitate learning the prediction matrix V (a possible method is suggested in the appendix) and using it in conjunction with an appropriate activation function f for predicting spatiotemporal inputs as given by equation 5.4. Simulation of such spatiotemporal cortical models constitutes an active direction of future research.

causing the appearance of a nonlinearity; (2) the model allows for neurons with nonlinear activation functions in the prediction step (equation 5.4) at each hierarchical level.

Model Predicts Neural Response Properties

745

9.1 Free Viewing of Natural Scenes by Humans. An interesting question is whether the model conforms to behavioral data obtained from human subjects during free viewing of natural images. Experiments by McConkie and others (McConkie, 1991; Grimes & McConkie, 1995) appear to be especially relevant in answering this equation. In these experiments, subjects were given the primary task of remembering an image, but they were also informed that changes might be introduced in the image as they examine it. They were asked to press a button when they detected a change. During a given trial, parts of the image was changed during certain saccades. Surprisingly, the subjects were remarkably unaware of such prominent changes as the addition or deletion of objects and changes in color. The remarkable insensitivity of human subjects to changes in the image during “free viewing” might appear to contradict the present model since, after all, a change in the image from a previous fixation should introduce a large residual and hence direct the subject’s attention to it.6 Recent experiments in our laboratory (Ballard, Hayhoe, Pook, & Rao, in press) and in McConkie’s laboratory (Irwin, McConkie, Carlson-Radvansky, & Currie, 1994) in fact do corroborate this expectation: Subjects do indeed become aware of the changes if the object that was subject to a change was relevant to the task at hand. For example, Ballard et al. (in press) report an increase in fixation time of subjects whenever task-relevant changes are introduced during the course of a block copying task. In addition, neurophysiological experiments by Miller et al. (1991) indicate that responses in IT neurons in the monkey are attenuated according to whether the presented stimulus matches a previously memorized object, yielding large responses (residuals) during mismatches and response suppression (small residuals) during matches. Indeed, the theoretical framework of Crick and Koch (see, for example, Koch, 1996) proposes that visual awareness may be correlated with responses in higher visual cortical areas such as V4 and IT, but probably not lower-level areas such as V1. This supports our argument that high residuals in lower-level areas, especially when in image regions not related to the task at hand, need not always imply awareness of the corresponding changes as in McConkie’s experiments. 9.2 Illusory Contours, Amodal Completions, Visual Imagery, and Bistable Percepts. We have shown that the ability of adjacent modules to contribute to the output of a neighbor (via feedback from the next level) endows the model with important properties, such as the ability to form completions in the presence of occlusions (for example, see Fig. 3). The same property also allows alternate interpretations of phenomena such as subjective-illusory contours and amodal completions (Kanizsa, 1990). For

6 This observation applies to models employing temporal context to predict postsaccadic stimuli, whereas the simulations described in this article used spatial context only.

746

Rajesh P. N. Rao and Dana H. Ballard

example, Grosof, Shapley, & Hawken (1992) and von der Heydt, Peterhans, and Baumgartner (1984) report the existence of neurons responding to illusory contours in V1 and V2, respectively. The present model suggests that these responses may be the result of expectation-based top-down feedback, as illustrated by Figure 3C. In this regard, the existence of overlap between adjacent receptive fields helps considerably in providing more accurate completions as compared to nonoverlapping receptive fields. The feedback pathways in the model also provide a convenient substrate for realizing visual imagery as illustrated by the top-down reconstructed images in Figure 3. Indeed, recent positron emission tomography (PET) studies have shown that imagery shares some of the same neural substrates as perception and can activate even the lowest visual areas (Kosslyn, Thompson, Kim, & Alpert, 1995). Equation 5.4 allows the effects of intrinsic neuronal noise to be modeled via the additive noise term n(t). By assuming a probability distribution for this term and sampling from this distribution in equation 5.4, the computation of the recognition state estimates can be made stochastic. An alternate but equally plausible method to make the model stochastic is to sample directly the current recognition state from a gaussian distribution with mean r and covariance M. Either of these settings conceivably could allow modeling of bistable percepts and Gestalt inversions such as Rubin’s vase or Necker’s cube. Such phenomena could be attributed to a lack of convergence at the highest level to a stable object hypothesis. The stochastic nature of the model in turn would allow periodic transitions between two alternate interpretations of the same visual image (local minima in recognition state space). Evaluating the efficacy of these ideas remains the subject of ongoing simulations.

9.3 Comparison with Related Work on Cortical Modeling. There has been considerable work in recent years on determining the computational nature of the cortex and the brain (Churchland & Sejnowski, 1992). This work includes models ranging from feedforward networks such as the hierarchical “perceptual network” of Linsker (1988) and the “HBF” networks of Poggio and collaborators (Poggio, 1990; Poggio, Fahle, & Edelman, 1992) to bidirectional “flow”–control-based models, such as Ullman’s (1994) Counterstreams model, Deacon’s (1989) “counter-current diffusion” model, and Van Essen, Anderson, and Olshausen’s (1994) Dynamic Routing Circuit model. Although an exhaustive survey is beyond the scope of this article, we briefly outline comparisons of the present model with some closely related models that employ both feedforward and feedback connections for perception and learning. Perhaps the earliest use of feedback mechanisms as integral components of perception can be found in the classic article by Pitts and McCulloch (1947) in which a negative feedback model of oculomotor control is pro-

Model Predicts Neural Response Properties

747

posed. Another early model in which feedback plays a central role is that of MacKay (1956), who describes an automaton that generates “imitative” internal responses to match incoming signals adaptively. The present model can be regarded as a concrete stochastic solution to the “epistemological problem” that is the subject of MacKay’s work. Feedback also plays an important role in several more recent theories, such as Carpenter and Grossberg’s (1987) adaptive resonance theory, Edelman’s (1978) “reentrant” signaling theory, Fukushima’s (1988) extended Neocognitron model, Harth, Unnikrishnan, and Pandya’s (1987) “Alopex” optimization model, Mumford’s (1994) model of the cortex based on Grenander’s (1976–1981) Pattern Theory, and Rolls’ (1989) theories on the hippocampus and cortical backprojections, all of which use feedback connections to instantiate forward models (Jordan & Rumelhart, 1992) of the input data but differ in the way top-down feedback is compared and integrated with bottom-up input. Of these, Mumford’s model is closest in spirit to the ideas presented in this article. However, iterative relaxation schemes such as those proposed by Mumford and others have been previously dismissed as models of perception due to the large number of iterations that may be required to ensure convergence, a fact that contradicts the rapidity of recognition of static stimuli in humans and primates (Thorpe & Imbert, 1989; Oram & Perrett, 1992; Tovee, Rolls, Treves, & Bellis, 1993). The model proposed in this article avoids the pitfalls of steepest descent relaxation algorithms by employing prediction weights V and approximate inverse models (as given by W) in the feedforward pathway that allow rapid (though perhaps crude and not always correct) one-shot estimates of static stimuli that are further refined by the state estimation algorithm if necessary. Inverse models also play a crucial role in the forward-inverse optics model of Kawato, Hayakawa, and Inui (1993) and in the Helmholtz machine (Dayan, Hinton, Neal, & Zemel, 1995; Hinton, Dayan, Frey, & Neal, 1995), both of which share considerable similarities with the model proposed in this article and both of which inspired many of the choices made in the present model. The Kawato et al. (1993) model minimizes the sum of the reconstruction error and a regularizer term incorporating a priori knowledge about the visual world such as smoothness of representations. This is similar to the MDL optimization function that we use but without the stochastic perspective. The Helmholtz machine (Dayan et al., 1995) is a hierarchical network that builds a stochastic generative model by using an inverse (“recognition”) model to approximate the true posterior distribution of the input data. For modifying the feedback (generative) and feedforward (recognition) weights of a given stochastic Helmholtz machine, Hinton et al. (1995) propose an elegant learning method known as the “wake-sleep” algorithm, which is derived using the MDL principle. While the derivation of the algorithm necessitates some approximations and simplifications, which may sometimes prevent it from converging to the correct posterior distribu-

748

Rajesh P. N. Rao and Dana H. Ballard

tion, experimental results using the algorithm have been positive (Hinton et al., 1995). In this regard, the nonlinear form of the model presented in this article also uses an approximation: a first-order Taylor series approximation to the nonlinear activation function f . A possible weakness of the Helmholtz machine, as stated by Dayan et al. (1995), is that recognition in the hierarchical network is a purely bottom-up feedforward process; the generative weights are superfluous and play no role in mediating perception. In addition, there are no lateral interactions between units at a given level.7 The model presented in this article allows for relatively complex dynamic interactions to occur between top-down and bottom-up signals, in addition to some lateral interactions as given by equation 5.3 among the units at each level. Such top-down–bottom-up interactions have been shown to occur in a wide variety of visual tasks in humans (see Churchland, Ramachandran, & Sejnowski, 1994, for a review). Another issue not directly addressed by many of the models above is that of modeling the dynamic nature of vision and the role of prediction in visual processing (see Softky, 1996) for some arguments regarding the necessity of prediction during perception). The model presented here, on the other hand, is essentially a hierarchical predictor-estimator that continually refines its recognition estimates at each level, allowing the visual system to anticipate incoming stimuli. 9.4 Comparison with Related Work on Kalman Filters. During the past decade, Kalman filters have been used in computer vision and robotics for tackling a wide variety of problems ranging from motion estimation to contour tracking (Hallam, 1983; Broida & Chellappa, 1986; Ayache & Faugeras, 1986; Matthies, Kanade, & Szeliski, 1989; Blake & Yuille, 1992; Dickmanns & Mysliwetz, 1992; Pentland, 1992). Much of this work crucially hinges on the ability to formulate accurate dynamic-physical models of the object properties being estimated. The formulation of such hand-coded models, however, becomes increasingly difficult in more complex dynamic environments. A crucial difference between the present approach and the approach predominant in much of the Kalman filter literature is that rather than being concerned solely with the estimation of state for a predefined model, we suggest that Kalman filter–based estimation algorithms may also be used for simultaneously learning the feedforward, feedback, and prediction models (“measurement” and “system” models in Kalman filter terminology) as given by their respective weight matrices. It remains to be seen whether these learned models perform favorably in environments where hand-coded models have proved to be unsatisfactory. Another important difference is the use of a hierarchical form of the Kalman filter. Our formulation in this regard is in many ways similar to the recent proposals of Chou,

7 However, see Dayan and Hinton (1996) for several interesting variants of the Helmholtz machine that address some of these issues.

Model Predicts Neural Response Properties

749

Willsky, and Benveniste (1994), Chou, Willsky, and Nikoukhah (1994), and Luettgen and Willsky (1995) who describe an optimal estimation algorithm for a class of multiscale dynamic models. These dynamic models are similar in spirit to the one we described in section 3, albeit without the temporal prediction step. In particular, Chou et al. (1994) exploit the analogy between time and scale, and define scale-recursive linear dynamic models evolving on dyadic trees, r(s) = A(s)r(s − 1) + B(s)w(s)

(9.1)

I(s) = C(s)r(s) + v(s),

(9.2)

where s denotes the current scale on the dyadic tree (highest values denote the leaves), and w and v are zero-mean white noise processes. This formulation leads to an elegant estimation algorithm recursive in scale, consisting of an upward sweep in which information in a subtree is fused in a fineto-coarse fashion, followed by a downward sweep in which information is spread back through the tree. For the simple linear case without the temporal prediction step, the iterative algorithm presented in this article can be seen to converge to approximately the same solution as that found by the upward-downward sweep algorithm proposed by Chou et al. (1994).8 The equations used by Chou et al. (1994) above can be seen to be closely related to the following two equations (when defined on a dyadic tree) that formed part of the more general stochastic model described in section 3: r(s, t) = rtd (s, t) + ntd (s, t)

(9.3)

I(s, t) = U(s, t)r(s, t) + nbu (s, t),

(9.4)

where we may use rtd (s, t) = Ui (s − 1, t)r(s − 1, t) if rtd happens to be information from the next higher level and Ui is the part of the higher-level weight matrix U that generates inputs for the given lower-level module (in Fig. 2, for example, the matrix Ui for the first-level module corresponds to the top one-fourth rows of the second-level matrix U as depicted in the figure). The correspondence between the various terms in the two sets of equations (9.1 and 9.3, 9.2 and 9.4) is quite striking. However, a few important differences are also evident. While the linear dynamic system model in Chou et al.’s (1994) framework is used only to relate transitions from one scale to the next, the model used in this article can be seen to be recursive in both scale and time. The recursion in scale is, however, implicit since information from another scale is handled in a modular fashion simply as input from a separate source (as in equations 3.5 and 9.3). The corresponding estimation algorithm is thus recursive in time as in the standard Kalman filter and allows for prediction in the temporal domain at each hierarchical 8

We thank Peter Dayan for pointing out this equivalence.

750

Rajesh P. N. Rao and Dana H. Ballard

level. As stressed previously, the ability to predict is crucial for perception and action. Being strictly recursive in scale, the Chou et al. (1994) algorithm requires that covariance matrix information be propagated across hierarchical levels. In contrast, the model presented here maintains covariance information locally at each level. In a biological setting, the computation of covariance information may be a limiting step and is perhaps best kept local. A further advantage of our formulation is that it lends itself naturally to arbitrary varieties of modularization by allowing estimates of a given state variable from disparate sources to be integrated according to their signal-tonoise ratios (this article illustrates the example of top-down and bottom-up sources). A direct consequence of such a formulation is that one can model arbitrary connections among different (but related) cortical areas that do not necessarily have to be related in a strictly multiscale, hierarchical fashion as required by the framework of Chou et al. (1994). A final but important difference is the prominent role accorded to learning in our approach, a subject not addressed by Chou et al. (1994). 9.5 Comparison with Related Work on Learning and Parameter Estimation. The subject of learning synaptic weights that resemble receptive field weighting profiles in the visual cortex has received considerable attention within the neural modeling community. Early work showed that receptive fields resembling those of simple cells in the striate cortex could be learned from natural images using algorithms ranging from competitive learning (Barrow, 1987) to principal component analysis (Hancock, Baddeley, & Smith, 1992). However, these fell short of providing a complete characterization due to limitations such as proclivity toward producing global and/or orthogonal receptive fields. More recently, algorithms that produce better characterizations of cortical receptive fields have been proposed independently by Olshausen and Field (1996) and Harpur and Prager (1996) (see also Bell & Sejnowski, 1996), building on previous work by Daugman (1988) and Pece (1992). As alluded to in section 2, these algorithms are closely related to the learning rule expressed in equation 6.4. In particular, the “sparseness” function of Olshausen and Field (1996) corresponds to the MDL model prior term |L(r)| in equation 4.6. Using different encoding distributions gives rise to different degrees of sparseness and localization in the internal representations. For example, Olshausen and Field (1996) show how a variety of nonlinear sparseness functions can be used to generate localized receptive fields.9 In our case, the gaussian distribution assumed when evaluating |L(r)| leads to a linear activity penalty that favors distributed codes; localization in our model is provided by the already existing cortical topology of hierarchical, overlapping receptive fields, whose size at

9

tion.

However, Baddeley (1996) presents some arguments against sparseness maximiza-

Model Predicts Neural Response Properties

751

each level is predetermined by the local dendritic field. Since the learning algorithms of Harpur and Prager (1996) and Olshausen and Field (1996) are based on gradient descent (as described in section 2), they require the specification of a schedule for adapting the learning rate parameter, whereas this adaptation occurs naturally in equation 6.4 due to the gradual reduction in the noise covariance as more inputs become available. More important, the learning rule expressed by equation 6.4, as well as those proposed for the other weights in the model, take into account not only bottom-up information but also top-down expectations during weight modification. This allows for the possibility of developing top-down guided task-relevant internal representations.10 The learning method presented in this article is closely related to mean field methods for Bayesian belief networks (Saul, Jaakkola, & Jordan, 1996; Jaakkola, Saul, & Jordan, 1996), which use mean field equations for computing lower bounds on the log likelihood–based cost function. The network parameters (the weights and biases) are subsequently learned by performing gradient ascent on the mean field derived bound for the log likelihood. In comparison, the learning scheme presented in this article can be regarded essentially as employing an on-line form of the EM algorithm (Dempster et al., 1977) in which the M-step performs maximum a posteriori (MAP) estimation rather than maximum likelihood (ML) estimation. Note that ML estimation is simply a special case of MAP estimation in which there is no prior information (this corresponds to the case where no predictions are computed from prior data as, for example, in section 2). In particular, the on-line MAP form of the EM algorithm is invoked in three different ways in the present framework: 1. Estimation of the state r. In this case, the “hidden” variables (or “missing” data) consist of the weights U and V while we treat the state vector r as the parameter to be estimated. Thus, the E-step involves computing the gaussian distributions given by (u, S) and (v, T). The M-step then uses these distributions from the E-step to computeb r and P, thereby maximizing the posterior probability of the parameter vector r (by minimizing the MDL-based cost function of section 4). Note, however, that in the case where the activation function f is nonlinear, the maximization in the M-step is not exact due to the Taylor series approximation to f used in the various update equations (the same applies to the other M-steps below). 2. Estimation of the generative weights u. Here, the hidden variables consist of the state vector r and the weights V, while the parameter to 10 A hierarchical learning method for vector quantization in the presence of noise is presented in Luttrell (1992). This method allows single-cause models, whereas the present model can encode multiple-cause models (Dayan & Zemel, 1995; Saund, 1995; see also Luttrell, 1995).

752

Rajesh P. N. Rao and Dana H. Ballard

be estimated is U. The E-step thus involves computing the distribur(t|N), P(t|N)); see below) tions given by (r, M) (or when feasible, (b u and Pu in the M-step, as and (v, T), which allow the calculation of b discussed in section 6. 3. Estimation of prediction weights v. In this case, the hidden variables comprise the state vector r and the weights U, and the parameter to be estimated is the state transition matrix V. Since V implements a Markov chain relating the states r in time (see equation 3.7), the E-step in this case becomes slightly more complicated than in the two cases above since the optimal estimate of the state r is no longer causal but depends temporally on future as well as past and current data values. In the case of hidden Markov models (HMMs) (Rabiner & Juang, 1986) (which are analogous to Kalman filters with discrete hidden states), this impasse is solved by using the familiar forward-backward procedure within the context of the Baum-Welch algorithm (Baum et al., 1970) (the Baum-Welch algorithm is, incidentally, a special case of the EM algorithm). In our case, we may use a form of optimal smoothing (Bryson & Ho, 1975), which, when given the batch input data for time t = 1, . . . , N, optimally assimilates past, current, and future data into temporally smoothed state estimates b r(t|N) (with covariance P(t|N)) for each of the time instants t = 1, . . . , N. This, along with the computation of (u, S), comprises the E-step. The M-step then involves computing b v and Pv as described in the appendix. It may be desirable in many cases to implement the various E- and Msteps above for each set of parameters in an asynchronous manner rather than in strict alternation. For example, the cortex may choose to update its estimates for the state r much more frequently than those for U and V in order to act in real time. The updates for U and V could then be carried out either online but at a slower rate than r, or alternately when the organism is in a rest phase. The latter suggestion is especially attractive in the case of learning V given that the batch processing of data required by the optimal smoothing step suggested above may not be appropriate in many circumstances that demand real-time responses from the organism. In such cases, it might be beneficial for the organism to act based on its current online state estimates while simultaneously storing behaviorally relevant input sequences in memory as batch data. Such stored data may then be replayed out at later, when the organism is in a more quiescent phase, in order to learn or fine-tune the prediction weights V. This suggests a possible computational role for dream sleep in facilitating learning and memory. 10 Summary and Conclusion Vision is fundamentally a dynamic process. Any putative model of the visual cortex must therefore acknowledge this fact and allow the modeling

Model Predicts Neural Response Properties

753

of dynamic processes. This article suggests that the hierarchical structure of the visual cortex is well suited to implement a multiscale estimation algorithm that can be regarded as a hierarchical form of the extended Kalman filter (Maybeck, 1979). At each hierarchical level, the algorithm recursively estimates the current recognition state by combining “measurement” information from top-down and bottom-up cortical modules with the prediction that was made by using a “system” model. The “system” matrix (the prediction weights) as well as the “measurement” matrices (feedforward and feedback weights) can be learned from visual experiences in a dynamic environment. Thus, the adaptive processes in the cortex are modeled at two different time scales: a fast dynamic state estimation process that allows the visual system to anticipate incoming stimuli and a slower Hebbian synaptic weight change mechanism. Both processes are characterized by their respective mean and covariance estimates, which are continually updated. Some of the salient features of the model as a recognition system were illustrated by training it on a small number of man-made objects and testing recognition performance. The model was shown to be robust to partial occlusions and confusing stimuli. An ongoing effort involves a more quantitative evaluation of the model’s recognition performance using realistic objects in natural scenes and the validation of the model in the domain of human visual search. Preliminary results along these two directions have been promising (Rao & Ballard, 1995a; Rao, Zelinsky, Hayhoe, & Ballard, 1996). When trained on natural images, the model developed feedforward receptive fields qualitatively resembling those found at the early hierarchical levels of the primate visual cortex. Current simulations involve applying the learning algorithms to successively higher levels of the network and comparing the receptive field properties that develop to those found at higher cortical levels such as IT. The hierarchical model also provides a computational platform for understanding a number of neurophysiologicalpsychophysical phenomena, such as illusory contours, amodal completions, visual imagery, bistable percepts, end stopping, and other effects that occur due to stimuli from beyond the classical receptive field. In particular, we provided simulations of the model suggesting that some of the recent results of Gallant et al. (1994, 1995; Gallant, 1996) regarding neural response suppression during free viewing of natural scenes may be interpreted as occurring due to predictions from higher levels of the cortex. While the present framework illustrates how information from two different sources, one top down and the other bottom up, may be integrated in a given cortical area, it is not hard to envision extensions of the model where more than two areas feed into a single area, as is evident in the visual cortex (Felleman & Van Essen, 1991). This would entail the relatively straightforward exercise of adding additional error/model terms to the optimization equations 4.5 and 4.6 to account for inputs from the additional areas before deriving the estimation algorithms.

754

Rajesh P. N. Rao and Dana H. Ballard

A reassuring feature of the model is that it does not explicitly depend on the input signals being visual. Thus, given that the neocortex is organized along similar laminar input-output principles regardless of cortical area or input modality (Creutzfeldt, 1977), it is reasonable to assume that the general framework proposed here may be uniformly applicable to other cortical areas, such as the parietal, auditory, or motor cortex, as well.11 We expect the results obtained using the current model to play a crucial role in guiding our future cortical modeling efforts. Appendix: Learning the Prediction Weights While the simulations in the article used static stimuli and therefore did not make use of the prediction step, it is almost certain that such predictions play an important role in the processing of time-varying stimuli. Therefore, we suggest a possible learning procedure for modifying the synaptic weights v of the (possibly nonlinear) prediction units during exposure to timevarying stimuli. The derivation is similar to that for the feedback weights u. b Given the current vector of responsesb r at time t, define the k × k2 matrix R to be   b=  R  

b rT 0 .. . 0

0 b rT .. . ...

... 0 ... 0 .. .. . . 0 b rT

   . 

(A.1)

Notice that the prediction step can be stated as: r(t + 1) = f (V(t)b r(t)) + n(t) b = f (R(t)v(t)) + n(t).

(A.2)

Differentiating the cost function J with respect to the vector of prediction weights v and setting the result to zero, we obtain, using equation A.2 for 11 We have recently shown (Rao & Ballard, 1996) how the model presented in this article can be extended to handle transformations in the image plane. The resulting estimation scheme parallels the functional dichotomy between the dorsal (occipitoparietal) and ventral (occipitotemporal) pathways in the primate visual cortex (Felleman & Van Essen, 1991). Support for the possibility of modeling the motor cortex using Kalman filter-like mechanisms comes from the work of Wolpert, Ghahramani, and Jordan (1995) who provide strong psychophysical evidence for the existence of such mechanisms in the context of sensorimotor integration.

Model Predicts Neural Response Properties

755

r(t + 1), µ −

∂f ∂v

¶T

b v(t)) − n(t)) + T−1 (b M−1 (r(t + 1) − f (R(t)b v(t) − v(t)) + βb v(t) = 0, (A.3)

which yields the update rule (for time t), µ b v(t) = v(t) + Pv

∂f ∂v

¶T

M−1 (r(t + 1)

b − n(t)) − βPv v(t), − f (R(t)v(t))

(A.4)

or more simply, µ b v(t) = v(t) + Pv

∂f ∂v

¶T

M−1 (r(t + 1) − r(t + 1)) − βPv v(t),

(A.5)

b v(t) + nv (t) and r(t + 1) = f (R(t)v(t)) + n(t). The correwhere v(t + 1) = b sponding covariance matrix is updated at time t as follows, −1

Pv (t) = (T(t)

µ +

∂f ∂v

¶T

M(t)−1

∂f + βI)−1 , ∂v

(A.6)

where I is the k2 × k2 identity matrix. The partial derivatives are evaluated at v(t) = v(t). Note that in order to obtain a closed-form expression for updating b v and Pv , we once again applied a first-order Taylor series approximation to f around the current estimate v(t). The covariance matrix T gets updated according to T(t + 1) = Pv (t) + 6v (t).

(A.7)

By appealing to a version of the EM algorithm (see section 9.5), we may use r(t + 1) = b r(t + 1|N) in equation A.5 above, where b r(t + 1|N) is the optimal temporally smoothed state estimate (Bryson & Ho, 1975) for time t + 1 (≤ N), given input data for each of the time instants 1, . . . , N (see the discussion in section 9.5). In certain circumstances, it might be possible to use the online estimateb r(t+1) in place of its computationally more expensive smoothed counterpartb r(t + 1|N), though such an approximation has been known to yield negative results in other related network architectures (Peter Dayan, pers. comm.). Simulations are currently underway to evaluate the effectiveness of the above learning rule and the role of temporally smoothing the state estimates. Intuitively, equation A.5 may be regarded as a form of quasi-Hebbian adaptation involving the presynaptic activity ∂ f /∂v, which is simply the

756

Rajesh P. N. Rao and Dana H. Ballard

b at time t for linear neurons, and the postsynaptic activity response R(t) (b r(t + 1|N) − r(t + 1)), which indicates the error in prediction. Computing the difference in the prediction error term requires the presence of some form of inhibitory interneurons. Inhibitory interneurons are also required by some of the other estimation algorithms derived in this article. Interestingly, a wide variety of such interneurons have been discovered in the primate visual cortex (Jones, 1981). Acknowledgments We are grateful to Peter Dayan and the two anonymous reviewers for their constructive comments and suggestions that immensely helped in improving the quality of the article. We are particularly indebted to Peter Dayan for extensive criticism of an earlier version, which inspired the more general formulation of the approach as described here. Thanks are also due to Jack Gallant for many interesting conversations, for reading drafts of the article, and for the large number of suggestions and clarifications he provided during the course of this work. We also thank Helder Araujo, Jessica Bayliss, Chris Brown, Avis Cohen, Mary Hayhoe, Kiriakos Kutulakos, Peter Lennie, Bill Merigan, Robbie Jacobs, Gerhard Sagerer, David Williams, and Gregory Zelinsky for pointers, comments, and discussions regarding the article. Some of the images used in the simulations were taken from the Columbia object database, courtesy of Shree Nayar. This research was supported by NIH/PHS research grants 1-P41-RR09283 and 1-R24-RR06853-02, and by NSF research grants IRI-9406481 and IRI-8903582. References Allman, J., Miezin, F., & McGuinness, E. (1985). Stimulus specific responses from beyond the classical receptive field: Neurophysiological mechanisms for local-global comparisons in visual neurons. Annual Review of Neuroscience, 8, 407–429. Andersen, R. A., Essick, G. K., & Siegel, R. M. (1985). Encoding of spatial location by posterior parietal neurons. Science, 230, 456–458. Andersen, C. H., & Van Essen, D. C. (1987). Shifter circuits: A computational strategy for dynamic aspects of visual processing. Proceedings of the National Academy of Sciences, 84, 6297–6301. Ayache, N., & Faugeras, O. D. (1986). HYPER: A new approach for the recognition and positioning of two-dimensional objects. IEEE Trans. Pattern Analysis and Machine Intelligence, 8(1), 44–54. Baddeley, R. (1996). Searching for filters with “interesting” output distributions: An uninteresting direction to explore? Network, 7(2), 409–421. Baizer, J. S., Robinson, D. L., & Dow, B. M. (1977). Visual responses of area 18 neurons in awake, behaving monkey. Journal of Neurophysiology, 40, 1024– 1037.

Model Predicts Neural Response Properties

757

Ballard, D. H., Hayhoe, M. M., Pook, P. K., & Rao, R. P. N. (in press). Deictic codes for the embodiment of cognition. Behavioral and Brain Sciences. Barrow, H. G. (1987). Learning receptive fields. In Proceedings of the IEEE Int. Conf. on Neural Networks, pp. 115–121. Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164–171. Bell, A. J., & Sejnowski, T. J. (in press). The “independent components” of natural scenes are edge filters. Submitted to Vision Research, 1996. Blake, A., & Yuille, A. (Eds.). (1992). Active vision. Cambridge, MA: MIT Press. Bolz, J., & Gilbert, C. D. (1986). Generation of end-inhibition in the visual cortex via interlaminar connections. Nature, 320, 362–365. Broida, T. J., & Chellappa, R. (1986). Estimation of object motion parameters from noisy images. IEEE Trans. Pattern Analysis and Machine Intelligence, 8(1), 90–99. Bryson, A. E., & Ho, Y.-C. (1975). Applied optimal control. New York: Wiley. Carpenter, G. A., & Grossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54–115. Chatfield, C., & Collins, A. J. (1980). Introduction to multivariate analysis. New York: Chapman and Hall. Chou, K. C., Willsky, A. S., & Benveniste, A. (1994). Multiscale recursive estimation, data fusion, and regularization. IEEE Transactions on Automatic Control, 39(3), 464–478. Chou, K. C., Willsky, A. S., & Nikoukhah, R. (1994). Multiscale systems, Kalman filters, and Riccati equations. IEEE Transactions on Automatic Control, 39(3), 479–492. Churchland, P., & Sejnowski, T. (1992). The computational brain. Cambridge, MA: MIT Press. Churchland, P. S., Ramachandran, V. S., & Sejnowski, T. J. (1994). A critique of pure vision. In C. Koch and J. L. Davis (Eds.), Large-Scale Neuronal Theories of the Brain (pp. 23–60). Cambridge, MA: MIT Press. Creutzfeldt, O. D. (1977). Generality of the functional structure of the neocortex. Naturwissenschaften, 64, 507–517. Daugman, J. G. (1980). Two-dimensional spectral analysis of cortical receptive field profiles. Vision Research, 20, 847–856. Daugman, J. G. (1988). Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression. IEEE Trans. Acoustics, Speech, and Signal Proc., 36(7), 1169–1179. Dayan, P., & Hinton, G. E. (1996). Varieties of Helmholtz machine. Neural Networks, 9(8), 1385–1403. Dayan, P., & Zemel, R. S. (1995). Competition and multiple cause models. Neural Computation, 7, 565–579. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7, 889–904.

758

Rajesh P. N. Rao and Dana H. Ballard

Deacon, T. (1989). Holism and associationism in neuropsychology: An anatomical synthesis. In E. Perecman (Ed.), Integrating Theory and Practice in Clinical Neuropsychology (pp. 1–47). Hillsdale, NJ: Erlbaum. DeAngelis, G. C., Robson, J. G., Ohzawa, I., & Freeman, R. D. (1992). Organization of suppression in receptive fields of neurons in cat visual cortex. Journal of Neurophysiology, 68(1), 144–163. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39, 1–38. Desimone, R., & Schein, S. J. (1987). Visual properties of neurons in area V4 of the macaque: sensitivity to stimulus form. Journal of Neurophysiology, 57, 835–868. Desimone, R., & Ungerleider, L. G. (1989). Neural mechanisms of visual processing in monkeys. In F. Boller & J. Grafman (Eds.), Handbook of Neuropsychology (Vol. 2, pp. 267–299). New York: Elsevier. Dickmanns, E. D., & Mysliwetz, B. D. (1992). Recursive 3D road and relative ego-state recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, 14(2), 199–213. Douglas, R. J., Martin, K. A. C., & Whitteridge, D. (1989). A canonical microcircuit for neocortex. Neural Computation, 1, 480–488. Duhamel, J., Colby, L., & Goldberg, M. E. (1992). The updating of the representation of visual space in parietal cortex by intended eye movements. Science, 255, 90–92. Edelman, G. M. (1978). Group selection and phasic re-entrant signaling: A theory of higher brain function. In V. B. Mountcastle & G. M. Edelman (Eds.), The Mindful Brain (pp. 51–100). Cambridge, MA: MIT Press. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1, 1–47. Feller, W. (1968). An introduction to probability theory and its applications (Vol. 1). New York: Wiley. Fukushima, K. (1988). A neural network for visual pattern recognition. Computer, 21(3), 65–75. Gallant, J. L. (1996). Cortical responses to natural scenes under controlled and free viewing conditions. Invest. Ophthalmol. Vis. Sci., 37(3), 674. Gallant, J. L., Connor, C. E., & Van Essen, D. C. (1994). Responses of visual cortical neurons in a monkey freely viewing natural scenes. Soc. Neuro. Abstr., 20, 838. Gallant, J. L., Connor, C. E., Drury, W., & Van Essen, D. C. (1995). Neural responses in monkey visual cortex during free viewing of natural scenes: Mechanisms of response suppression. Invest. Ophthalmol. Vis. Sci., 36, 1052. Gilbert, C. D., & Wiesel, T. N. (1981). Laminar specialization and intracortical connections in cat primary visual cortex. In F. O. Schmitt, F. G. Worden, G. Adelman, & S. G. Dennis (Eds.), The Organization of the Cerebral Cortex (pp. 163–191). Cambridge, MA: MIT Press. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7(2), 219–269. Grenander, U. (1976–1981). Lectures in pattern theory I, II and III: Pattern analysis, pattern synthesis and regular structures. Berlin: Springer-Verlag.

Model Predicts Neural Response Properties

759

Grimes, J., & McConkie, G. (1995). On the insensitivity of the human visual system to image changes made during saccades. In K. Akins (Ed.), Problems in Perception. Oxford: Oxford University Press. Grosof, D. H., Shapley, R. M., & Hawken, M. J. (1992). Macaque striate responses to anomalous contours? Invest. Ophthalmol. Vis. Sci., 33, 1257. Gulyas, B., Orban, G. A., Duysens, J., & Maes, H. (1987). The suppressive influence of moving textured backgrounds on responses of cat striate neurons to moving bars. Journal of Neurophysiology, 57(6), 1767–1791. Hallam, J. (1983). Resolving observer motion by object tracking. In Proc. of 8th International Joint Conf. on Artificial Intelligence (Vol. 2, pp. 792–798). Hancock, P. J. B., Baddeley, R. J., & Smith, L. S. (1992). The principal components of natural images. Network, 3, 61–70. Harpur, G. F., & Prager, R. W. (1996). Development of low-entropy coding in a recurrent network. Network, 7(2), 277–284. Harth, E., Unnikrishnan, K. P., & Pandya, A. S. (1987). The inversion of sensory processing by feedback pathways: A model of visual cognitive functions. Science, 237, 184–187. Heeger, D. J., Simoncelli, E. P., & Movshon, J. A. (1996). Computational models of cortical visual processing. Proc. National Acad. Sciences, 93, 623–627. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1161. Hubel, D. H. (1988). Eye, brain, and vision. New York: Scientific American Library. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. Journal of Physiology (London), 160, 106–154. Hubel, D. H., & Wiesel, T. N. (1965). Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. Journal of Neurophysiology, 28, 229–289. Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology (London), 195, 215–243. Irwin, D. E., McConkie, G. W., Carlson-Radvansky, L., & Currie, C. (1994). A localist evaluation solution for visual stability across saccades. Behavioral and Brain Sciences, 17, 265–266. Jaakkola, T., Saul, L. K., & Jordan, M. I. (1996). Fast learning by bounding likelihoods in sigmoid type belief networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8 (pp. 528–534). Cambridge, MA: MIT Press. Jones, E. G. (1981). Anatomy of cerebral cortex: Columnar input-output organization. In F. O. Schmitt, F. G. Worden, G. Adelman, & S. G. Dennis (Eds.), The Organization of the Cerebral Cortex (pp. 199–235). Cambridge, MA: MIT Press. Jordan, M. I., & Rumelhart, D. E. (1992). Forward models: Supervised learning with a distal teacher. Cognitive Science, 16, 307–354. Kalman, R. E. (1960). A new approach to linear filtering and prediction theory. Trans. ASME J. Basic Eng., 82, 35–45. Kalman, R. E., & Bucy, R. S. (1961). New results in linear filtering and prediction theory. Trans. ASME J. Basic Eng., 83, 95–108.

760

Rajesh P. N. Rao and Dana H. Ballard

Kanizsa, G. (1990). Subjective contours. In I. Rock (Ed.), The Perceptual World: Readings from Scientific American Magazine (pp. 155–163). San Francisco: Freeman. Kawato, M., Hayakawa, H., & Inui, T. (1993). A forward-inverse optics model of reciprocal connections between visual cortical areas. Network, 4, 415–422. Koch, C. (1996). Towards the neuronal substrate of visual consciousness. In S. R. Hameroff, A. W. Kaszniak, & A. C. Scott (Eds.), Towards a Science of Consciousness: The First Tucson Discussions and Debates. Cambridge, MA: MIT Press. Kosslyn, S. M., Thompson, W. I., Kim, I. J., & Alpert, N. M. (1995). Topographical representations of mental images in primary visual cortex. Nature, 378, 496– 498. Li, M., & Vitanyi, P. (1993). An introduction to Kolmogorov complexity and its applications. New York: Springer-Verlag. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21(3), 105–117. Luettgen, M. R., & Willsky, A. S. (1995). Likelihood calculation for a class of multiscale stochastic models, with application to texture discrimination. IEEE Transactions on Image Processing, 4(2), 194–207. Lund, J. S. (1981). Intrinsic organization of the primate visual cortex, area 17, as seen in Golgi preparations. In F. O. Schmitt, F. G. Worden, G. Adelman, & S. G. Dennis (Eds.), The Organization of the Cerebral Cortex (pp. 105–124). Cambridge, MA: MIT Press. Luttrell, S. P. (1992). Self-supervised adaptive networks. IEE Proceedings Part F, 139, 371–377. Luttrell, S. P. (1995). A componential self-organising neural network (Tech. Rep. No. DRA/CIS(SE1)/651/11/RP/3.1). Malvern, UK: Defence Research Agency. MacKay, D. M. (1956). The epistemological problem for automata. In C. E. Shannon & J. McCarthy (Eds.), Automata Studies (pp. 235–251). Princeton, NJ: Princeton University Press. Marcelja, S. (1980). Mathematical description of the responses of simple cortical cells. Journal of the Optical Society of America, 70, 1297–1300. Matthies, L., Kanade, T., & Szeliski, R. (1989). Kalman filter-based algorithms for estimating depth from image sequences. International Journal of Computer Vision, 3, 209–236. Maunsell, J. H. R., & Newsome, W. T. (1987). Visual processing in monkey extrastriate cortex. Annual Review of Neuroscience, 10, 363–401. Maybeck, P. S. (1979). Stochastic models, estimation, and control (Vols. 1, 2). New York: Academic Press. McConkie, G. W. (1991). Perceiving a stable visual world. In Proceedings of the Sixth European Conference on Eye Movements (pp. 5–7). Leuven, Belgium: Laboratory of Experimental Psychology. Mignard, M., & Malpeli, J. G. (1991). Paths of information flow through visual cortex. Science, 251, 1249–1251. Miller, E. K., Li, L., & Desimone, R. (1991). A neural mechanism for working and recognition memory in inferior temporal cortex. Science, 254, 1377–1379.

Model Predicts Neural Response Properties

761

Muller, J. R., Singer, B., Krauskopf, J., & Lennie, P. (1996). Center-surround contrast effects in macaque cortex. Invest. Ophthalmol. Vis. Sci., 37(3), 904. Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In C. Koch & J. L. Davis (Eds.), Large-Scale Neuronal Theories of the Brain (pp. 125– 152). Cambridge, MA: MIT Press. Murphy, P. C., & Sillito, A. M. (1987). Corticofugal feedback influences the generation of length tuning in the visual pathway. Nature, 329, 727–729. Nelson, J. I., & Frost, B. J. (1978). Orientation-selective inhibition from beyond the classic visual receptive field. Brain Research, 139, 359–365. Nowlan, S. J., & Hinton, G. E. (1992). Simplifying neural networks by soft weightsharing. Neural Computation, 4(4), 473–493. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1, 61–68. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Oram, M. W., & Perrett, D. I. (1992). Time course of neural responses discriminating different views of the face and head. Journal of Neurophysiology, 68(1), 70–84. Palmer, L. A., Jones, J. P., & Stepnoski, R. A. (1991). Striate receptive fields as linear filters: Characterization in two dimensions of space. In A. G. Leventhal (Ed.), The Neural Basis of Visual Function (pp. 246–265). Boca Raton: CRC Press. Pece, A. E. C. (1992). Redundancy reduction of a Gabor representation: A possible computational role for feedback from primary visual cortex to lateral geniculate nucleus. In I. Aleksander & J. Taylor (Eds.), Artificial Neural Networks 2 (pp. 865–868). Amsterdam: Elsevier Science. Pentland, A. P. (1992). Dynamic vision. In G. A. Carpenter & S. Grossberg (Eds.), Neural Networks for Vision and Image Processing (pp. 133–159). Cambridge, MA: MIT Press. Pitts, W., & McCulloch, W. S. (1947). How we know universals: The perception of auditory and visual forms. Bulletin of Mathematical Biophysics, 9:127–147. Poggio, T. (1990). A theory of how the brain might work. In Cold Spring Harbor Symposia on Quantitative Biology (pp. 899–910). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Poggio, T., Fahle, M., & Edelman, S. (1992). Fast perceptual learning in visual hyperacuity. Science, 256, 1018–1021. Poggio, T., Torre, V., & Koch, C. (1985). Computational vision and regularization theory. Nature, 317, 314–319. Rabiner, L. R., & Juang, B. H. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3, 4–16. Rao, R. P. N. (1996). Robust Kalman filters for prediction, recognition, and learning (Tech. Rep. No. 645). Rochester, NY: Dept. of Computer Science, University of Rochester. Rao, R. P. N., & Ballard, D. H. (1995a). An active vision architecture based on iconic representations. Artificial Intelligence (Special Issue on Vision), 78, 461– 505. Rao, R. P. N., & Ballard, D. H. (1995b). Dynamic model of visual memory predicts neural response properties in the visual cortex (Tech. Rep. No. 95.4). Rochester,

762

Rajesh P. N. Rao and Dana H. Ballard

NY: National Resource Laboratory for the Study of Brain and Behavior, Department of Computer Science, University of Rochester. Rao, R. P. N., & Ballard, D. H. (1996). A class of stochastic models for invariant recognition, motion, and stereo (Tech. Rep. No. 96.1). Rochester, NY: National Resource Laboratory for the Study of Brain and Behavior, Department of Computer Science, University of Rochester. Rao, R. P. N., Zelinsky, G. J., Hayhoe, M. M., & Ballard, D. W. (1996). Modeling saccadic targeting in visual search. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8 (pp. 830–836). Cambridge, MA: MIT Press. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific. Rockland, R. S., & Pandya, D. N. (1979). Laminar origins and terminations of cortical connections of the occipital lobe in the rhesus monkey. Brain Research, 179, 3–20. Rolls, E. (1989). The representation and storage of information in neuronal networks in the primate cerebral cortex and hippocampus. In R. Durbin, C. Miall, & G. Mitchison (Eds.), The Computing Neuron (pp. 125–159). Reading, MA: Addison-Wesley. Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61–76. Saund, E. (1995). A multiple cause mixture model for unsupervised learning. Neural Computation, 7, 51–71. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656. Softky, W. R. (1996). Unsupervised pixel-prediction. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8 (pp. 809–815). Cambridge, MA: MIT Press. Thorpe, S. J., & Imbert, M. (1989). Biological contraints on connectionist modelling. In R. Pfeifer, Z. Schreter, F. Fogelman-Soulie, & L. Steels (Eds.), Connectionism in Perspective (pp. 63–92). Amsterdam: Elsevier. Tovee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. Journal of Neurophysiology, 70(2), 640–654. Ullman, S. (1994). Sequence seeking and counterstreams: A model for bidirectional information flow in the cortex. In C. Koch & J. L. Davis (Eds.), LargeScale Neuronal Theories of the Brain (pp. 257–270). Cambridge, MA: MIT Press. Van Essen, D. C. (1985). Functional organization of primate visual cortex. In A. Peters & E. G. Jones (Eds.), Cerebral Cortex (vol. 3, pp. 259–329). New York: Plenum. Van Essen, D. C., Anderson, C. H., & Olshausen, B. A. (1994). Dynamic routing strategies in sensory, motor, and cognitive processing. In C. Koch & J. L. Davis (Eds.), Large-Scale Neuronal Theories of the Brain (pp. 271–299). Cambridge, MA: MIT Press. Van Essen, D. C., & Maunsell, J. H. R. (1983). Hierarchical organization and functional streams in the visual cortex. Trends in Neuroscience, 6, 370–375.

Model Predicts Neural Response Properties

763

von der Heydt, R., Peterhans, E., & Baumgartner, G. (1984). Illusory contours and cortical neuron responses. Science, 224, 1260–1262. Williams, R. J. (1985). Feature discovery through error-correction learning (Tech. Rep. No. 8501). Institute for Cognitive Science, University of California at San Diego. Wolpert, D. M., Ghahramani, Z., & Jordan, M. I. (1995). An internal model for sensorimotor integration. Science, 269, 1880–1882. Young, R. A. (1985). The gaussian derivative theory of spatial vision: Analysis of cortical cell receptive field line-weighting profiles (Research Publication GMR4920). Warren, MI: General Motors. Zeki, S. M. (1976). The functional organization of projections from striate to prestriate visual cortex in the rhesus monkey. Cold Spring Harbor Symp. Quant. Biology, 40, 591–600. Zeki, S. M. (1978). Uniformity and diversity of structure and function in rhesus monkey prestriate visual cortex. Journal of Physiology (London), 277, 273–290. Zeki, S. M. (1983). Colour coding in the cerebral cortex: The reaction of cells in monkey visual cortex to wavelengths and colours. Neuroscience, 9, 741–765. Zemel, R. S. (1994). A minimum description length framework for unsupervised learning. Unpublished doctoral dissertation, Department of Computer Science, University of Toronto.

Received January 4, 1996; accepted August 6, 1996.

NOTE

Communicated by Jonathan Baxter

Correction to “Lower Bounds on VC-Dimension of Smoothly Parameterized Function Classes”1 Wee Sun Lee Peter L. Bartlett Department of Systems Engineering, RSISE, Australian National University, Canberra, ACT 0200, Australia

Robert C. Williamson Department of Engineering, Australian National University, Canberra, ACT 0200, Australia

The earlier article gives lower bounds on the VC-dimension of various smoothly parameterized function classes. The results were proved by showing a relationship between the uniqueness of decision boundaries and the VC-dimension of smoothly parameterized function classes. The proof is incorrect; there is no such relationship under the conditions stated in the article. For the case of neural networks with tanh activation functions, we give an alternative proof of a lower bound for the VC-dimension proportional to the number of parameters, which holds even when the magnitude of the parameters is restricted to be arbitrarily small.

Theorem 10 from the earlier article, which was used to prove lower bounds on various neural network classes, is incorrect: Theorem 10. Let A be an open subset of Rm , X be an open subset of Rn , and f : A × X → R be a continuously differentiable function (in all of its arguments). Let Fθ := {θ( f (a, ·)): a ∈ A} where θ (y) = 1 for y > 0 and θ (y) = 0 otherwise. If there exists a k-dimensional manifold M ⊂ A such that for all a1 , a2 ∈ M, a1 6= a2 implies that {x ∈ X: f (a1 , x) = 0} 6= {x ∈ X: f (a2 , x) = 0} then VCdim(Fθ ) ≥ k. As counterexamples, observe that y = (a−x)2 has unique decision boundaries but VC-dimension zero. Similarly y = (x−a)(x−b)2 has unique decision boundaries but VC-dimension 1.

1

W. S. Lee, P. L. Bartlett, & R. C. Williamson, Neural Computation, 7(5), 1040–1053 (1995).

Neural Computation 9, 765–769 (1997)

c 1997 Massachusetts Institute of Technology °

766

Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson

Theorem 10 was used to prove a lower bound for the VC-dimension of neural networks with tanh activation functions. We can prove this lower bound more directly using Lemma 9 from the same article: Lemma 9. Let A be an open subset of Rm , X be an open subset of Rn , and f : A × X → R be a continuously differentiable function (in all of its arguments). Let Fθ := {θ( f (a, ·)): a ∈ A} where θ (y) = 1 for y > 0 and θ (y) = 0 otherwise. Let g: A × Xm 7→ Rm be defined by g(a, x1 , . . . , xm ) = ( f (a, x1 ), . . . , f (a, xm ))T . Let gx (a) = g(a, x). Let ϕ be any diffeomorphism from an open subset of A to a subset of Rm and ψx = gx ◦ ϕ −1 . If VCdim(Fθ ) < k, then for every a ∈ A, for every x ∈ Xm , gx (a) = 0 ⇒ rank(Dψx (b)) < k, where b = ϕ(a). Definition 1. A two-layer feedforward network with an n-dimensional input x = (x1 , . . . , xn ) ∈ Rn , k hidden units with σ activation function, and weights w = (v10 , . . . , vkn , w0 , . . . , wk ) ∈ RW (where W = kn + 2k + 1) is a function f : RW × Rn → R given by f (w, x) = w0 +

k X

wi σ (vi · x + vi0 ),

i=1

where vi = (vi1 , . . . , vin ) and vi · x = are called the offsets.

Pn

j=1 vij xj . The weights w0 , vi0 , i

= 1, . . . , k

First we give some conditions on the activation function σ for which our result below holds. These conditions are the similar to those used in Albertini, Sontag, and Maillot (1993) for proving the independence property. The function tanh satisfies these conditions. Definition 2. The function σ is admissible if it satisfies the following conditions. It is a real-analytic function, and it extends to a function σ : C → C analytic on a subset D ⊂ C of the form

D = {z ∈ C: |Im z| ≤ λ}\{z0 , z0 }, where Im z0 = λ and ∞ > λ > 0. Here z0 is the complex conjugate of z0 , and z0 and z0 are poles of σ . We want to use Lemma 9 to find a lower bound for the VC-dimension of a two-layer feedforward network with admissible activation functions. First we show that the following matrix of derivatives with respect to first-layer weights is nonsingular for almost all sets of kn inputs S = {x1 , . . . , xkn }. (The

Correction

767

technique used is similar to that in Bartlett & Dasgupta 1995.) Define D(S) = (1)   0 σ (v1 · x1 + v10 )w1 x11 σ 0 (v1 · x2 + v10 )w1 x21 · · · σ 0 (v1 · xkn + v10 )w1 xkn 1   ..  σ 0 (v · x1 + v )w x1 · · · . 1 10 1 2     .   ..     0 1 1  σ (v1 · x + v10 )w1 xn   ..     .   0   σ (vk · x1 + vk0 )wk x11     ..   . σ 0 (vk · x1 + vk0 )wk x1n · · · Let ωi,j = vi · x j + vi0 . Define a diagonal matrix Z(S) = diag((ω1,1 − p1,1 )m , . . . , (ω1,n − p1,n )m , (ω2,n+1 − p2,n+1 )m , . . . , (ωk,kn − pk,kn )m ) where m is the multiplicity of the pole of σ 0 (·) at pi,j and pi,j ∈ {z0 , z0 } for each element of Z(S). We will show that there is a value of S such that D(S)Z(S) is nonsingular, which implies that D(S) is nonsingular. Each component of D(S) and Z(S) is an analytic function of S on the set Dkn ; hence the determinant is also an analytic function of S. This means that the determinant is nonzero for almost all values of S in Dkn and hence also for almost all values of S when each member of S is from Rn . Pick an S such that vi · x j + vi0 = pi,j for each pi,j which appears in a component of Z(S) and for all other i, j, vi · x j + vi,0 ∈ D and is not a pole of σ 0 (·). This can be done as long as no two hidden units are identical. (For each hidden unit i, we can choose its n corresponding x j ’s so that the activations are at the same pole and avoid the hyperplanes defining the poles of the other hidden units. If we choose the components of each x j small enough, the activation will remain in D.) All the singularities in D(S)Z(S) are removable, and the determinant can be made an analytic function of S. With the singularities removed, the matrix D(S)Z(S) assumes a block diagonal form as the entries where vi · x j + vi,0 is not a pole become zero. To complete the proof, we need each block in the block diagonal matrix to have full rank. To do that, we need only choose the x j ’s within each block such that they are linearly independent and avoid a finite number of hyperplanes that define the poles for the other hidden units. This is always possible since the input dimension is n and each block is of size n × n. We can use this result to prove the following modification of Theorem 13 in the earlier article. The bound obtained is (k − 1)(n − 1) instead of (k − 1)(n + 1) + 1.

768

Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson

x 1

x2

..

.. .

w1

x n-1

wk xn

v kn

Figure 1: Network with weights deleted.

Theorem 3. Let Fθ be the class of two-layer feedforward networks (thresholded at zero at the output) with k hidden units with tanh activation, input space X: = {(x1 , . . . , xn ) ∈ Rn : |xi | < C} (where C is a constant greater than zero), and k(n + 2) + 1 weights restricted to an open set, which includes the origin. Then VCdim(Fθ ) ≥ (k − 1)(n − 1). Proof. Delete all the weights from the nth input and all the weights connected to the kth hidden unit except the weight connecting the two of them, as shown in Figure 1. Let the smaller network (with weights deleted) be f (w, x) = w0 +

k−1 X

wi tanh(v0i · x0 + vi0 )

i=1

+wk tanh(vkn xn ) = g(w0 , x0 ) + wk tanh(vkn xn ), where w0 = (v10 , . . . , vk−1,n−1 , w0 , . . . , wk−1 ), v0i = (vi1 , . . . , vi,n−1 ), and x0 = (x1 , . . . , xn−1 ). We will also fix wk and vkn to be nonzero constants. Now, g(w0 , x0 ) is a network with n − 1 inputs and k − 1 hidden units (the network within the box in Fig. 1). We have shown that for almost all sets of (k − 1)(n − 1) real inputs, the rank of the matrix (1) is at least (k − 1)(n − 1) as long as no two hidden units have the same set of weights. If we choose the outputs for g(w0 , ·) in the range of h: xn 7→ −wk tanh(vkn xn ) (this can be

Correction

769

done since we can choose x0 as small as we like), the inputs in g(w0 , ·) can be made to belong to the boundary by choosing xn appropriately. Hence inputs such that the matrix in Lemma 9 have rank (k − 1)(n − 1) exist (almost all sets of n − 1 dimensional inputs of size (k − 1)(n − 1) that map g(w0 , ·) to the range of h: xn 7→ −wk tanh(vkn xn ), and the corresponding xn ’s will have this property). Lemma 9 then gives the desired result. Acknowledgments We thank Adam Kowalczyk, Yossi Erlich, and Dan Chazan for pointing out the error. References Albertini, F., Sontag, E. D., & Maillot, V. (1993). Uniqueness of weights for neural networks. In R. Mammone (Ed.), Artificial Neural Networks for Speech and Vision (pp. 115–125). London: Chapman and Hall. Bartlett, P. L., & Dasgupta, S. (1995). Exponential convergence of a gradient descent algorithm for a class of recurrent neural networks. In Proceedings of the 38th Midwest Symposium on Circuits and Systems.

Received January 15, 1996; accepted July 12, 1996.

NOTE

Communicated by Ronald Meir

Lower Bound on VC-Dimension by Local Shattering Yossi Erlich∗ Engineering Department, Technion-Israel Institute of Technology, Haifa, Israel

Dan Chazan IBM Science and Technology Center, Haifa, Israel

Scott Petrack Vocaltec Ltd., Herzeliya, Israel

Avraham Levy Mathematics Department, Technion-Israel Institute of Technology, Haifa, Israel

We show that the VC-dimension of a smoothly parameterized function class is not less than the dimension of any manifold in the parameter space, as long as distinct parameter values induce distinct decision boundaries. A similar theorem was published recently and used to introduce lower bounds on VC-dimension for several cases (Lee, Bartlett, & Williamson, 1995). This theorem is not correct, but our theorem could replace it for those cases and many other practical ones. 1 Introduction Many classification schemes use some kind of smoothly parameterized function class, from which one classifier is chosen, according to some learning scheme. The so-called learnability of the classifier’s class is often described by the VC-dimension of the classifier’s class. Therefore, lower or upper bounds of such VC-dimensions are of great interest. Recently a theorem for a lower bound on the VC-dimension of a smoothly parameterized function class was used to find a lower bound for some interesting cases (Theorem 10 of Lee et al., 1995). The theorem states that the VC-dimension of a parametric classifiers family is bounded by the dimension of any manifold in the classifier parameter space, where any two sets of parameters values in this manifold result in a different decision boundary. Korian and Sontag (1996) showed that for the case of neural networks, the VC-dimension is lower bounded by an order of the square of the number of weights. However, Theorem 10 of Lee et al.

∗ Current

address: T. J. Watson Center, IBM, Yorktown Heights, NY.

Neural Computation 9, 771–776 (1997)

c 1997 Massachusetts Institute of Technology °

772

Yossi Erlich, Dan Chazan, Scott Petrack, and Avi Levy

(1995) is of great interest since it is more global and is not an asymptotic bound. In order to obtain a corrected theorem, it was necessary to modify the definition of the allowable parameterizations of the decision sets. In Lee et al.’s article, any function h(a, x) defines a classification of x into two classes s1 = {x: h(a, x) > 0} and s2 = {x: h(a, x) < 0}. In addition, the function h(a, x) is assumed to have the form g(a, y) + z where x = (y, z). Furthermore, the differentiability with respect to the classified space could be reduced to a convexity assumption. On the other hand, differentiability with respect to the parametric space A is critical; we cannot demand continuity only, since a continuous map does not preserve the dimension. For instance, it is possible to define a continuous invertible scalar function h: Rk → R, which is used to classify x ∈ R by the sign of x − h(a), where a ∈ Rk is the classifier parameter. In this case, VC-dimension equals 1, regardless of the dimension k of the parameter space Rk , although boundary uniqueness exists. In an additional work, Lee et al. (1997) propose a correction to their original article that bypasses the difficulty with their original theorem. They do this by a direct proof of special cases of the general theorem within the context of a specific application. Definition 1 (Shattering, VC-dimension). Let F be a set of classifiers over a set X. A finite subset of X is said to be shattered by F if for any dichotomy on it, there exists a classifier in F that realizes it. The VC-dimension of F is the size of the maximal shattered set. Following the idea of Lee et al. (1995), the theorem is proved by showing that there is a set of k points that are locally shattered. By local shattering we mean that there exists an arbitrarily small open set in the parameter space, which still shatters X. The theorem for the case of a single classifier decision boundary is given in section 2. A more general theorem for the case of countable decision boundaries is stated in Erlich (1996). 2 The Theorem Theorem. Let A be an open subset of Rm (the parameter space). Let X be a subset of Rn (the feature space). If there exists a subset Xˆ of X, and a kdimensional manifold M of A, such that the following assumptions holds then the VC-dimension is at least k: • There exists a parametric representation Y × Z for Xˆ such that Y ⊆ R(n−1) , Z ⊆ R, and Xˆ ⊆ Y × Z. (We will represent x ∈ Xˆ by x = (y, z), where y ∈ Y, z ∈ Z.)

Lower Bound on VC-Dimension

773

ˆ ∀0 ≤ α ≤ 1, (y, αz1 + • Xˆ is convex for every y ∈ Y. (∀(y, z1 ), (y, z2 ) ∈ X, ˆ (1 − α)z2 ) ∈ X. • Let g: M × Y → R be a continuously differentiable function in M that represents a boundary in Xˆ in the following sense: For every a ∈ M, the classification of x = (y, z) ∈ Xˆ is determined by the sign of z − g(a, y). For a zero argument, the value of the sign function can be arbitrarily chosen. This will have no effect on the proof of the theorem. • The boundaries are unique in M. (For all a1 6= a2 ∈ M, exists y ∈ Y, ˆ and g(a1 , y) 6= g(a2 , y).) such that (y, g(a1 , y)), (y, g(a2 , y)) ∈ X, Proof Outline. Following Lee et al. (1995), the gist of the proof involves showing that there exists a parameter value a and k points y1 , . . . , yk on the boundary defined by g(a, yi ) = zi , such that {∇a g(a, yi )}—the gradients of g at (a, yi ) w.r.t. a—are linearly independent. Therefore all dichotomies can be obtained by parameter values obtained by infinitesimal variations of the parameter a (following the ideas of Lee et al., 1995). The proof proceeds by induction. At each step of the induction there exist a point al and l points y1 , . . . , yl , such that the gradients {∇a g(a, yi )} span an l-dimensional space. It follows from this that within the submanifold Ml defined by g(a, yi ) = zi i = 1, . . . , l, there exists a neighborhood of al for which the gradients {∇a g(a, yi )} also span an l-dimensional space. It is then shown that a new point al+1 may be found within this neighborhood on the manifold for which the gradient {∇a g(al+1 , yl+1 )} is not perpendicular to the submanifold Ml . Because the submanifold is orthogonal to the gradients {∇a g(a, yi )} by definition (since g(a, yi ) is constant within this manifold), it follows that ∇a g(al+1 , yl+1 ) is not in the space spanned by {∇a g(al+1 , yi )}. For this reason, yl+1 may be adjoined to yl to regenerate the induction hypothesis. In this way, a point ak is found at the end of the induction for which there exist yi , i = 1, . . . , k such that the gradients {∇a g(ak , yi )} are all independent. Proof. Choose two classifiers whose parameters are a1 and a2 from the parametric k-dimensional manifold M. By the uniqueness condition, we ˆ and must be able to find y1 ∈ Y such that (y1 , g(a1 , y1 )), (y1 , g(a2 , y1 )) ∈ X, g(a1 , y1 ) 6= g(a2 , y1 )). Without a loss of generality, let g(a1 , y1 ) < g(a2 , y1 ). We choose a smooth curve in M (continuously differentiable) between those two classifiers, h: R → M, such that h(0) = a1 and h(1) = a2 . Define the function f : R → R by f (t) = g(h(t), y1 ). f is continuously differentiable in [0, 1], and it satisfies f (0) < f (1). Therefore, there exists 0 < t0 < 1, such that f 0 (t0 ) 6= 0 and f (0) < f (t0 ) < f (1). Define a0 = h(t0 ), and z1 = g(a0 , y1 ). By ˆ the convexity condition of Xˆ for constant y, we conclude that (y1 , z1 ) ∈ X. By the chain rule: f 0 (t0 ) = (∇a g(a0 , y1 ))T h0 (t0 ),

(2.1)

774

Yossi Erlich, Dan Chazan, Scott Petrack, and Avi Levy

where ∇a g(a, y1 ) is the gradient of g with respect to a. Since f 0 (t0 ) 6= 0 we conclude that ∇a g(a0 , y1 ) is not perpendicular to the k-dimensional manifold M. Therefore, there exists an open set A1 ⊆ A containing a0 , such that the set M1 = {a ∈ A1 ∩ M | g(a, y1 ) = z1 }

(2.2)

is diffeomorphic to a (k − 1)-dimensional ball (this follows directly from the implicit function theorem). ˆ Let us summarize what we have done so far. We have found (y1 , z1 ) ∈ X, and an open set A1 in the k-dimensional manifold M, in which z1 = g(a, y1 ) defines a (k − 1)-dimensional manifold in M, named M1 . ˆ a new point We can continue in the same way and find (y2 , z2 ) ∈ X, a0 ∈ M1 , and an open set A2 containing a0 in M1 , in which z2 = g(a, y2 ) defines a set, namely, M2 , which is diffeomorphic to a (k − 2)–dimensional ball. Specifically M2 is defined by M2 = {a ∈ A2 ∩ M1 | g(a, y2 ) = z2 }.

(2.3)

We further limit the open set A2 to one in which ∇a g(a, y2 ) is not spanned by ∇a g(a, y1 ) . This is possible since it is satisfied at the new a0 (∇a g(a0 , y1 ) is orthogonal to M1 ) and due to the continuity of ∇a g(a, ·). Continuing this way, we get a set of k points (yi , zi ) ∈ Xˆ for i = 1, . . . , k

(2.4)

and an open set Ak ∈ A, such that the set Mk = {a ∈ Ak ∩ Mk−1 | g(a, yk ) = zk }

(2.5)

is a zero-dimensional ball, containing a single point (name this point by aˆ ), where for every i = 2, . . . , k, ∇a g(ˆa, yi ) is not spanned by {∇a g(ˆa, y1 ), . . . , ∇a g(ˆa, yi−1 )}. Define B as a k × m matrix whose rows are ∇a g(ˆa, yi ). Therefore, B is of full rank. An almost identical proof to that of Lemma 9 in Lee et al. (1995) shows that the VC-dimension is at least k. The classifier decision boundary definition is different from the definition in the original theorem. The motivation to the changes is given by Erlich (1996). The above theorem can be easily applied to many cases—for instance, the cases that were considered by Lee et al. (1995)—two-layered neural network with activation functions satisfying the property IP, where tanh is one possibility (Theorem 17 of Lee et al., 1995) or radial basis function as activation function, using gaussian basis (Theorem 19 of Lee et al., 1995). In

Lower Bound on VC-Dimension

775

all those cases, boundary uniqueness is proved by introducing the boundary in an explicit form, for which one coordinate in X (could be referred to as Z) is given as a function (in a k-dimensional manifold) of the others (could be referred to as Y). Therefore for all those cases, the theorem may be easily applied. The theorem could be used to find lower bound of the VC-dimension for HMM hidden Markov model–based classifiers (Erlich, Merhav, & Chazan, in preparation). Yet there are cases where this theorem is not easily applied. For instance, if X = R, the theorem could be applied to cases were the boundary is no more than a single point, and consequently k is no more than 1. Another example is a classification of x = (x1 , x2 ) ∈ R2 by the sign of (x2 +y2 −a1 )(x2 +y2 −a2 ), where a = (a1 , a2 ) ∈ R2 is the parameter space. In this case, simple utilization of the theorem could lead to 1 as a lower bound of the VC-dimension. It is desirable to be able to obtain easily the tight bound 2, which could be obtained by the spirit of the theorem. These issues are resolved by a more general theorem concerning multiple decision boundaries (Erlich, 1996). In this generalized theorem, a countable set of boundaries is considered. We will not give a detailed proof for this generalization, since it is a simple extension of the previous one. For many cases, this lower bound is not tight, typically (although not necessarily) for cases where the shattering is not local. (For instance, it is easy to construct classes of classifiers, defined by a single parameter, whose VC-dimension is infinite).1 Acknowledgments We thank Wee Sun Lee, Peter Bartlett, and Bob Williamson for private correspondence concerning their work and Ronny Meir for his helpful comments. References Erlich, Y. (1996). On HMM based speech recognition using the MCE approach. M.Sc. thesis, Technion-Israel Institute of Technology. (Note: written in Hebrew; abstract in English.) Erlich, Y., Merhav, N., & Chazan, D. (1996). On the MCE approach for training HMM based speech recognition systems. Unpublished manuscript. Korian, P., & Sontag, E. (1996). Neural networks with quadratic VC dimension. In D. S. Touretzky, M. C. Moser, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems (vol. 8, pp. 197–203). Cambridge, MA: MIT Press. Lee, W. S., Bartlett, P. L., & Williamson, R. C. (1995). Lower bounds on the VC dimension of smoothly parameterized function classes. Neural Computation, 7, 1040–1053. 1

For an example to such a feedforward neural net, refer to Sontag (1992).

776

Yossi Erlich, Dan Chazan, Scott Petrack, and Avi Levy

Lee, W. S., Bartlett, P. L., & Williamson, R. C. (1997). Correction to “Lower Bounds on the VC Dimension of Smoothly Parameterized Function Classes.” Neural Computation, 9, 765–769. Sontag, E. (1992). Feedforward nets for interpolation and classification. J. Comp. Syst. Sci., 45, 20–48.

Received January 10, 1996; accepted September 5, 1996.

Communicated by Shimon Edelman

SEEMORE: Combining Color, Shape, and Texture Histogramming in a Neurally Inspired Approach to Visual Object Recognition Bartlett W. Mel Department of Biomedical Engineering, University of Southern California, Los Angeles, California 90089 USA

Severe architectural and timing constraints within the primate visual system support the conjecture that the early phase of object recognition in the brain is based on a feedforward feature-extraction hierarchy. To assess the plausibility of this conjecture in an engineering context, a difficult three-dimensional object recognition domain was developed to challenge a pure feedforward, receptive-field-based recognition model called SEEMORE. SEEMORE is based on 102 viewpoint-invariant nonlinear filters that as a group are sensitive to contour, texture, and color cues. The visual domain consists of 100 real objects of many different types, including rigid (shovel), nonrigid (telephone cord), and statistical (maple leaf cluster) objects and photographs of complex scenes. Objects were individually presented in color video images under normal room lighting conditions. Based on 12 to 36 training views, SEEMORE was required to recognize unnormalized test views of objects that could vary in position, orientation in the image plane and in depth, and scale (factor of 2); for nonrigid objects, recognition was also tested under gross shape deformations. Correct classification performance on a test set consisting of 600 novel object views was 97 percent (chance was 1 percent) and was comparable for the subset of 15 nonrigid objects. Performance was also measured under a variety of image degradation conditions, including partial occlusion, limited clutter, color shift, and additive noise. Generalization behavior and classification errors illustrate the emergence of several striking natural shape categories that are not explicitly encoded in the dimensions of the feature space. It is concluded that in the light of the vast hardware resources available in the ventral stream of the primate visual system relative to those exercised here, the appealingly simple feature-space conjecture remains worthy of serious consideration as a neurobiological model. 1 Introduction The human visual system can recognize unprimed views of common objects at sustained rates in excess of 10 per second (Potter, 1976; Subramaniam, Neural Computation 9, 777–804 (1997)

c 1997 Massachusetts Institute of Technology °

778

Bartlett W. Mel

Biederman, Kalocsai, & Madigan, 1995). In neurophysiological terms, the implication of this astonishing fact is that an object’s identity can be computed by a 100-ms-wide time slice of neural activity propagating through the recognition “pipe” from the retina to the higher reaches of the visual system. At typical firing rates of visual neurons, we may surmise that under these special conditions, each neuron in the visual pathway devotes fewer than a dozen action potentials to the processing of any single object’s image. While these figures bear primarily on the throughput of the recognition pipe (objects per second), a recent electroencephalogram study reveals that the length of the pipe from retina to recognition is probably no longer than 155 ms (DiGirolamo & Kanwisher, 1995). These results are also roughly consistent with known response latencies of neurons in the object recognition areas of inferotemporal cortex, which range around 100 ms (Oram & Perrett, 1992; Gochin, Colombo, Dorfman, Gerstein, & Gross, 1994). Beyond such feats of sheer speed, the human visual system easily recognizes a wide variety of stimuli, including rigid, articulated, and entirely nonrigid two-dimensional (2D) and three-dimensional (3D) objects, faces, statistical or fractal objects, surface textures, and views of complex natural scenes, while displaying remarkable insensitivity to changes in viewpoint, partial occlusion, and clutter. How can a visual system work so well and so fast? The circumstances of biological vision suggest a solution of low time complexity (few steps) but high space complexity (many parallel processes)—synaptic connections among neurons in the cortical object recognition pathway probably number between 10 and 100 trillion.1 One appealingly simple notion is that the visual system is organized as a feedforward feature-extracting hierarchy that builds progressively more complex and viewpoint-invariant features useful for identifying objects, where invariance over a group of transformations is achieved by summing over viewpoint-specific elemental representations (Pitts & McCullough, 1947). This class of algorithm has been brought to bear with great success, for example, in the realm of optical character recognition (Fukushima, Miyake, & Ito, 1983; LeCun et al., 1990). According to this view, and in correspondence with the axioms of statistical pattern recognition, visual object recognition in the brain is the process of mapping retinal pixels into a feature space that is better suited (than pixels) to the viewpointinvariant classification and identification problems faced by visual animals. Within this feature space, represented by the activity of neurons at the top of the hierarchy, the similarity of one object view to another is given by a simple (e.g., Euclidean) distance calculation; recognition of an input is achieved by finding the identity or class of the most similar training view previously

1 This assumes 1014 to 1015 synapses in the brain, 90 percent of these due to the cerebral cortex, one-third of these due to the visual system, and one-third of these accounting for the “what” pathway.

SEEMORE

779

stored in memory. Feature dimensions are chosen such that large changes in object pose produce relatively small excursions in feature space, while small changes in object “quality” (shape, texture, color) produce relatively large excursions in feature space. Where 3D objects or scenes are to be recognized over large regions of the viewing sphere, a small set of reference views is stored during learning, leading to a “view-based” approach (Edelman & Bulthoff, 1992; Logothetis, Pals, Bulthoff, & Poggio, 1994; Murase & Nayar, 1995). In evaluating the plausibility of this feedforward feature-extraction scenario as a model for biological vision, it is useful to consider several desirable properties of a feature-space representation as dictated by the “computational ecology” of natural vision. In particular, we might expect the brain to construct visual features that are: 1. Large in number, relating to the fact that sparse, high-dimensional feature representations provide fundamental advantages for fast, inexpensive recognition, especially large signal-to-noise ratios that allow objects to be recognized prior to explicit segmentation (Califano & Mohan, 1994; Kanerva, 1988). 2. Useful, that is, high-level features should be relatively sensitive to object quality—and hence identity—but relatively insensitive to an object’s pose or configuration. 3. Dominated by spatially localized measures, relating to the need to cope with nonrigid object transformations, which often preserve local but not global structure; the need to cope with object textures, defined in large part by local relative-orientation structure; and the need to cope with occlusion and clutter, which are least disruptive to an object’s internal code when derived from features with localized support. 4. Driven by multiple visual cues, relating to the need to maximize object discrimination power by using all available visual cues, the need to represent objects of many different types richly, and the need to buffer the visual representation of objects or scenes against a variety of forms of image degradation, to which different visual cues are by nature differentially sensitive. In this context, recent neurophysiological results (Kobatake & Tanaka, 1994; Logothetis et al., 1994) that have extended classical results from other groups (Gross, Rocha-Mironda, & Bender, 1972; Perrett, Rolls, & Caan, 1982; Schwartz, Desimone, Albright, & Gross, 1983; Desimone, Albright, Gross, & Bruce, 1984) are intriguing, as they demonstrate a substantial population of neurons in the “object recognition areas” of the primate visual system (Mishkin, 1982) that respond best to specific complex minipatterns (e.g., localized conjunctions of contour, texture, and color elements). In many cases, these neurons exhibit considerable insensitivity to changes in viewpoint-

780

Bartlett W. Mel

related parameters, such as stimulus position and scale (Ito, Tamura, Fujita, & Tanaka, 1995), while remaining selective for their preferred minipatterns. Such empirical data are thus on their face suggestive of a feature-extraction hierarchy designed to cope with known ecological pressures. However, serious theoretical concerns persist regarding the limitations of bottom-up feature-space approaches, cropping up under a variety of guises: 1. Feature-space methods are essentially template-matching methods and so require an intractable number of templates to cope with inputs that have been rotated, scaled, partly occluded, nonrigidly transformed, or presented under varying lighting conditions. 2. Feature-space approaches lack a top-down component essential for resolving featural ambiguities present in real images. 3. Feature-space methods do not scale well to high dimensions (i.e., large numbers of features), either because it is not practical to learn or otherwise assemble a sufficiently large number of sufficiently useful features, too many dimensions of noise necessarily overwhelm too few dimensions of signal, or high-dimensional methods are computationally intractable or require too much data, even for a brain. Most of these concerns have been addressed in recent years, as featurespace approaches and their variants have been applied with success in a variety of object recognition domains (Fukushima et al., 1983; Swain and Ballard, 1991; Viola, 1993; Lades et al., 1993; Califano & Mohan, 1994; Murase & Nayar, 1995; Rao & Ballard, 1995; Amit, Geman, & Wilder, 1995; Schiele & Crowley, 1996). However, the conjecture that a feature-space approach based on feedforward receptive-field-style computations could account for the prodigious recognition capacities of the primate brain as yet lacks direct support in the modeling literature. Existing approaches have generally involved one or more assumptions that place them squarely outside the biological “paradigm” for general-purpose visual recognition, such as use of small object corpus (typically no more than a few dozen objects), limited range of object types (e.g., faces or rigid volumetric objects), feature computations not amenable to receptive-field-style computations (e.g., use of sophisticated geometric invariants), or strong viewpoint assumptions, or corresponding explicit image prenormalization operations (typical in optical character recognition and face recognition). Against this backdrop, SEEMORE was developed to assess more directly the plausibility of the feature-space account as a biological model for rapid, general-purpose object recognition. Key design goals prescribed a visual representation that (1) relied exclusively on geometrically simple receptivefield-style computations, (2) operated directly on input images without shift, scale, or other explicit object prenormalization steps, (3) could cope with a large number of real 3D objects of many different types, (4) could recognize objects over 6 degrees of freedom of viewpoint, gross nonrigid shape dis-

SEEMORE

781

tortions, and partial occlusion, and (5) exhibited a significant capacity for generalization across natural object categories. SEEMORE’s visual representation is composed of 102 feature channels that emphasize spatially localized filter computations and are collectively sensitive to contour shape, color, and texture cues. SEEMORE’s architecture is similar in spirit to the color histogramming approach of Swain and Ballard (1991) but includes spatially structured features that also provide for shape-based generalization. Experiments reveal good recognition performance in a viewpoint- and configuration-invariant 3D object recognition problem with 100 objects, including objects that are rigid, nonrigid, and “statistical” in nature and photographs of complex scenes. Perhaps most interesting, SEEMORE’s patterns of generalization reveal the emergence of several striking object catagories that are not explicitly encoded in the 102 feature-space dimensions. (A short report describing this work appeared in Mel, 1996.) 2 Methods 2.1 SEEMORE’S Visual World. SEEMORE’s database contains 100 common 3D objects and photogaphs of scenes, each represented by a set of presegmented color video images (see Figure 1). The training set consisted of 12 to 36 views of each object as follows. For rigid objects, 12 training views were chosen at roughly 60-degree intervals in depth around the viewing sphere (see Figure 2A), and each view was then scaled to yield a total of three images at 67 percent, 100 percent, and 150 percent. Image plane orientation was allowed to vary arbitrarily. For nonrigid objects, 12 training views were chosen in random poses (see Figure 2B). During a recognition trial, SEEMORE was required to identify novel test images of the database objects. For rigid objects, test images were drawn from the viewpoint interstices of the training set, excluding highly foreshortened views (e.g., bottom of can). Each test view could therefore be presumed to be correctly recognizable but never closer than roughly 30 degrees in orientation in depth or 22 percent in scale to the nearest training view of the object, while position and orientation in the image plane could vary arbitrarily. For nonrigid objects, test images consisted of entirely novel random poses. Each test view depicted the object presented alone on a smooth background. In some recognition trials, test views were systematically degraded. The five degradation conditions were (1) scrambled, in which the object view was cut up and the pieces rearranged and reflected (see Figure 3A), (2) occluded, in which 50 percent of the object view was blacked out (see Figure 3B), (3) cluttered, in which a second object—a randomly colored lowercase character— was superimposed onto the margin of the image to add limited pattern clutter while minimizing occlusion (see Figure 3C), (4) colorized, in which two of the three RGB channels were scaled either up or down by 30 per-

782

Bartlett W. Mel

Figure 1: The database contains 100 objects of many different types, including rigid (soup can), nonrigid (necktie), statistical (bunch of grapes), and photographs of complex indoor and outdoor scenes.

cent while maintaining constant intensity (see Figure 3D), and (5) noisy, in which the images were subjected to uniform additive pixel noise with mean 30 percent of the maximum pixel value (see Figure 3E). 2.2 Feature Channels. SEEMORE’s internal representation of a view of an object is encoded by a set of feature channels. The ith channel is based on an elemental nonlinear filter fi (x, y, θ1 , θ2 , . . .), parameterized by position in the visual field and zero or more internal degrees of freedom (see Figure 4). Each channel is by design relatively sensitive to changes in the image that are strongly related to object identity, such as the object’s shape, color, or

SEEMORE

783

A

B

Figure 2: (A) Training views of rigid objects were sampled uniformly about the 3D viewing sphere. (B) Training views of nonrigid objects were drawn at random from the object’s configuration space.

texture, while remaining relatively insensitive to changes in the image that are unrelated to object identity, such as are caused by changes in the object’s pose. In practice, this invariance is achieved in a straightforward way for each channel by subsampling and summing the output of the elemental channel filter over the entire visual field and onePor more of its internal degrees of freedom, giving a channel output Fi = x,y,θ1 ,... , fi (). For example, a particular shape-sensitive channel might “look” for the image-plane projections of right-angle corners, over the entire visual field, 360 degrees of rotation in the image plane, 30 degrees of rotation in depth, one octave in scale, and tolerating partial occlusion or slight misorientation of the elemental contours that define the right angle. In general, then, Fi may be viewed as a “cell” with a large receptive field whose firing rate is an estimate of the number of occurrences of distal feature i in the visual work space over a large range of viewing parameters. SEEMORE’s architecture consists of 102 feature channels, whose outputs form an input vector to a nearest-neighbor classifer (see Figure 5). Following the design of the individual channels, the channel vector F = {F1 , . . . , F102 } is (1) insensitive to changes in image plane position and orientation of the object, (2) modestly sensitive to changes in object scale, orientation in depth, or nonrigid deformation, but (3) highly sensitive to object quality as pertains to object identity. Within this representation, total memory storage for all views of an object ranged from 1224 to 3672 integers. As shown in Figure 6, SEEMORE’s channels fall into in five groups: (1) 23 color channels, each of which responds to a small blob of color parameterized by “best” hue and saturation, (2) 11 coarse-scale intensity corner channels parameterized by open angle, (3) 12 “blob” features, parameter-

784

Bartlett W. Mel

A

B

C

D

E

Figure 3: Sensitivity of recognition performance to various forms of image degradation was studied. Examples of the five degradation conditions are shown: (A) scrambled, (B) occluded, (C) cluttered, (D) colorized, and (E) noisy.

SEEMORE

785

To Classifier Fi output of channel i

Fi =

fi

x,y,

local features for channel i

fi (x, y,

Camera Image

Figure 4: The ith feature channel is based on an elemental nonlinear filter fi (x, y, θ1 , . . .), which is subsampled across the image, at a range of image-plane orientations, and over any of the filters’ additional internal degrees of freedom. The output of the ith channel Fi is the sum over all elemental filter outputs within that channel.

ized by the shape (round and elongated) and size (small, medium, and large) of bright and dark intensity blobs, (4) 24 contour-shape features, including straight angles, curve segments of varying radius, and parallel and oblique line combinations, and (5) 16 shape- and texture-related features based on the outputs of Gabor functions at five scales and eight orientations. The implementations of the channel groups were crude, in the interest of achieving a working, multiple-cue system with minimal development time. Images were grabbed using an off-the-shelf Sony S-Video Camcorder and SunVideo digitizing board that provided 11 color bits per pixel (YUV); images were

786

Bartlett W. Mel

Classifier

F1

F2

F3

F4

Fn

Camera Image

Figure 5: SEEMORE’s architecture consists of a set of channels that provides input to a nearest-neighbor classifier. Since each channel is by itself insensitive to translation and rotation in the image plane and is slowly varying as an object rotates in depth, the output vector F for the complete set of channels is similarly invariant to these transformations. However, F remains highly sensitive to changes in object quality, and hence object identity.

converted to RGB format for all subsequent processing. Implementation details of the 102 feature channels are described below. 2.2.1 Twenty-three Color Channels. The image was first low-pass-filtered by two octaves using a five-element separable mask (1/16, 1/4, 3/8, 1/4, 1/16), and subsampled to a size of 120 × 120 color pixels. In the following, for R, G, √ B ∈ [0, 1], hue, saturation, and intensity (HSI) are defined as H = arctan( 3(G − B)/((R − G) + (R − B))), S = 1.0 − min(R, G, B)/I, I = (R + G + B)/3, and sigmoids g and h with threshold θ and gain s are defined as g(x, θ, s) = 1/(1+exp((θ −x)∗s)), and h = 1− g. Of the 23 color channels, 22 (11 hues at 2 levels of saturation) were derived as follows. Around the hue circle, 1D “receptive field” centers were placed at 11 evenly spaced hues, including 0 degree. Response profiles huei , i ∈ {1, . . . , 11} were triangular and symmetrical, returning a peak score of 1.0 in response to the center hue, with a linear decay to 0 over a 45-degree radius. The 11 high- and lowsaturation unit pairs shared the same hue centers. The high-saturation unit were computed as a product of the hue score and two sigmoidal outputs yhisat i factors that (1) fell to 0 as the saturation crossed into the range of the lowsaturation units and (2) fell to 0 at low-intensity values where chromaticity

SEEMORE

Colors

787

Blobs

Angles

Contours

Gabor-Based Features

energy @ scale i

sin2 + cos2

energy variance @ scale i

0 r

r

45 90

<30 >30

Figure 6: SEEMORE’s 102 channels fall into five groups: (1) 23 circular hue/saturation channels, (2) 11 coarse-scale intensity corner channels, (3) 12 circular and oriented-intensity blobs, (4) 24 contour-shape features, including curves, junctions, and parallel and oblique contour pairs, and (5) 16 orientedenergy and relative-orientation features based on the outputs of Gabor functions at several scales and orientations.

788

Bartlett W. Mel

information is unreliable, giving yhisat = huei · g(S, 0.4, 10) · g(I, 0.2, 20). The i were computed as a product of the hue low-saturation unit outputs ylowsat i score and three sigmoidal factors—the additional sigmoid was needed to = bound the saturation valued both from above and below—giving ylowsat i huei · h(S, 0.4, 10) · g(S, 0.15, 20) · g(I, 0.2, 20). A separate white unit selected for pixels of high intensity and low saturation, with an output given by ywhite = g(I, 0.6, 20) · h(S, 0.15, 20). Black pixels were ignored. Each of the 23 color functions was evaluated at every pixel in the image; any response value exceeding 0.1 generated a count within a histogram containing 23 bins, one for each “color.” 2.2.2 Eleven Intensity Corner Channels. The image was first low-passfiltered and subsampled by two octaves to a size of 120 × 120 intensity pixels. A crude-oriented intensity edge detector was run in 12 increments of 30 degrees around the circle centered at each pixel; the length of the edge was approximately 20 pixels when mapped back into the full-resolution image (480 × 480). The criterion for detection of an oriented edge was twofold: (1) An intensity gradient of a consistent sign should exist across the edge at multiple sites along the length of the edge, and (2) the mean deviation of intensity (from its average) should be small along the length of the edge on both sides. Both criteria were computed based on pairs of pixels symmetrically straddling the edge along its length, as shown in Figure 7. The numerical score for an edge defined in terms of left and right pixels from three pixel pairs inP Q dexed by i was given by y = i g(|IiR −IiL |, 0.08, 30)−λ i (|IiR −IiR |+|IiL −IiL |), where λ = 0.8 controlled the relative importance of the gradient versus smoothness constraints. The result was passed through a hard binary threshold giving 1 for y > 0, 0 otherwise. The 12 oriented binary edge maps were used to find intensity corners, which were sought at every pixel. Corners were parameterized only by open angle (from 30 to 330 degrees in 30-degree increments), yielding 11 corner types and 11 corresponding histogram bins. A corner was scored whenever two edge segments of the proper relative orientation and offset were found in the image at any absolute orientation, and the corresponding histogram bin was incremented. 2.2.3 Twelve Intensity Blob Channels. The image was first low-pass-filtered and subsampled by two, three, and four octaves to yield image sizes of from 120 × 120 to 30 × 30 intensity pixels. The operations described below were carried out at each of the three scales. An intensity blob consisted of a set of intensity gradients of the proper sign in an elliptical configuration about a central pixel. They were thus elliptical versions of the straight edge segments of Figure 7, based on several sigmoidally modulated pixel pair intensity differences. However, in lieu of three gradient terms straddling a straight edge combined multiplicatively, a sum of 12 gradient terms

SEEMORE

789

cross-edge difference

along-edge smoothness L1 L2

R1

L3 R2 R3

Figure 7: Straight-oriented edge segments were detected in the intensity image based on (1) sign consistency of cross-edge intensity differences, measured at several sample pairs of pixels, and (2) along-edge intensity “smoothness,” measured as mean deviation in intensity on either side of putative edge.

across an elliptical contour was computed. A smoothness term was again given by the mean deviation of intensity along the inside and outside of the circular contour; this term was subtracted from the across-contour term with λ = 1.0. If a threshold of 4.0 was exceeded, then a blob was scored, and the corresponding histogram bin was incremented. Twelve total bins corresponded to light or dark, round or elongated intensity blobs, at three scales. 2.2.4 Forty Generalized Contour Channels. The image was low-pass-filtered as before but maintained at full resolution (480 × 480). An oriented edge detector was run that differed in three ways from that underlying the intensity corner channels. First, edges consisted of five pairs of across-edge pixels spanning a total length of approximately 18 original image pixels at full resolution, and were computed at 7.5 degree intervals around the halfcircle. Second, an edge was scored iff all five gradient terms exceeded a hard threshold (in lieu of a product of sigmoids), and no along-contour smoothness penalty term was used. However, the gradient threshold differed from

790

Bartlett W. Mel

edge to edge, given by an estimate of the along-edge gradient (average difference of end pixel values on both sides of edge). Third, an edge was scored if it existed according to the above criteria in any of the RGB channels, but neither the RGB channel nor the polarity of the edge was stored. As such, the output of the edge-detecting stage was a set of 24 binary contour maps representing orientations from 0 to 172.5 degrees, in 7.5-degree intervals at a resolution of 240 × 240 pixels. Every edge in the binary edge maps was “generalized” (copied) into a 2×2 block of neighboring pixels, in order effectively to relax the position tolerances in the subsequent compound-feature detection step. Forty compound contour features were then tested for at every pixel (240 × 240) and orientation (48 steps of 7.5 degrees). A compound feature consisted of six oriented contours with proper offsets and relative orientations; when at least five were present, the compound feature was scored, and its histogram bin was incremented. Various optimizations were used to speed computation, including heuristics for early rejection of both pixels and compound features. As shown in Figure 6, features included seven long, simple contours (straight plus 6 degrees of curvature), nine corner features with angles from 30 to 150 degrees, in 15-degree steps (analogous to intensity corners but with longer limbs, higher angular resolution, and tolerating up to one missing link), and parallel and oblique line pairs (12 each at a range of separations from 16 to 82 pixels in increments of 6 pixels). 2.2.5 Sixteen Gabor-Derived “Texture” Channels. The largest central square in the original image was reduced to 128 × 128 pixels. Forty Gabor filters were applied to the image at eight orientations and five scales using a Fast Fourier Transform (FFT) √ (scales varied from X pixels per cycle to Y pixels per cycle in powers of 2). Energy images were computed by squaring and summing corresponding pixels in the sine and cosine components of the Gabor output. The energy pixel values were then summed across all eight orientation images at each scale, giving the total oriented energy at each scale as the values of the first five texture channels. The next five texture channels consisted of the variances of the oriented energy at each scale, gotten by computing the mean squared deviation of the total energy at each orientation from the mean total energy for all orientations, for that scale. High variances signified images with energy distributed nonuniformly across the orientation channels, such as an image with one or two dominant orientations. The Gabor energy images were then passed through a fixed binary thresholding operation chosen for each scale to emphasize localization of perceptually relevant oriented image structure. The remaining six texture channels measured probability of relative orientation energy in the image, parameterized by orientation difference (0 degrees, 45 degrees, 90 degrees) and image distance (d < 30 pixels, d ≥ 30 pixels). All relevant suprathreshold pixel pairs were considered and used to generate counts in the six histogram bins.

SEEMORE

791

2.3 The Learning Rule. While nearest-neighbor classification techniques are remarkably powerful given their simplicity (Friedman, 1994), the problem of scaling feature dimensions in order to minimize classification error in high dimension remains an experimental art. The need for such optimization is particularly acute in cases, including the present case, where feature dimensions are grossly inhomogeneous in terms of their variances, entropy, and individual classification power and where strong, poorly characterized correlations exist among the features dimensions. The notion that features should be simply normalized by their variances (i.e., “sphering” the data) does not take account of the individual utility of the features for the classification problem at hand or of the correlations between features. One approach to this problem has been to project a high-dimensional feature space onto the low-dimensional subspace in which the exemplars of each class are as tightly clustered as possible while the class means are as widely dispersed as possible (e.g., Fisher’s linear discriminant; Duda & Hart, 1973). Where dimensionality reduction is not needed for computational or other reasons, classification in all available dimensions in principle allows for improved classification performance. Thus, related methods involve finding a “clustering transformation” of the input space that maximizes between-class distances while minimizing within-class distances (Fukunaga, 1990). In the approach here, an objective function that predicts classification performance using all dimensions is maximized using gradient ascent. The formulation is based on the standard assumption that, on average, two views of the same object are more similar to each other than two views of different objects. Thus, measurements are taken centered at every view j in the database, comparing the average distance from j to views of the same object with the average distance from j to views of different objects, where the comparison is normalized by a local measure of the dispersion of interclass distances. In this way, the mean distance to views of the same object can be treated as a z-score with respect to the distribution of distances to different objects (see Figure 8). The normalization is local to each view since the dispersion of object views can vary substantially in different regions of the feature space. Classification performance is expected to increase as these z-scores increase, averaged over the entire database of views. We begin by writing the weighted city-block distance between two views represented as N-dimensional vectors j, k as Djk = D(j, k, w) =

N X

i wi Djk ,

i=1 i = |ji − ki | is the distance where i is the feature index, wi is a weight, and Djk along single feature dimension i. The mean distance between view j and all views k of the same object, or of different objects, is written, respectively, as

D≡j = hDjk i, ∀k ≡ j

Probability

792

Bartlett W. Mel

distribution of distances from view j to views of same object

distribution of distances from view j to views of different objects

j

D

j

D

j

Figure 8: Graphical representation of D≡j , D6≡ j , and σ6≡i j . Measured distances from j to members of same object class (shown as solid lines) give rise to distribution 1 whose mean is D≡j . Similarly, measured distances from j to members of all other classes give rise to distribution 2 whose (larger) mean is D6≡ j . The predicted “goodness” of feature i in the neighborhood of view j, Gj , is then given by the z-score (D6≡ j − D≡j )/σ6≡ j .

or D6≡ j = hDjk i, ∀k 6≡ j, where the identity sign is used to indicate all views k of the same (≡ j) or different (6≡ j) object as j. Analogous mean distance quantities are defined for the single feature dimension i, that is, i Di ≡j = hDjk i, ∀k ≡ j

and i Di 6≡ j = hDjk i, ∀k 6≡ j .

The dispersion about view j of distances to views k of different objects is given by the standard deviation, for both the one-dimensional and Ndimensional cases, q i − Di 6≡ j )2 i, ∀k 6≡ j σ6≡i j = h(Djk or σ6≡ j =

q h(Djk − D6≡ j )2 i, ∀k 6≡ j .

SEEMORE

793

Both one-dimensional and N-dimensional feature “goodness” measures may then be defined in the neighborhood of view j as Gji =

Di 6≡ j − Di ≡j σ6≡i j

Gj =

D6≡ j − D≡j . σ6≡ j

and

The global N-dimensional feature goodness measure is then given by G = hGj i. G is maximized when, averaged over all views j in the database, the average distance to views of the same object is many standard deviations smaller than the average distance to views of different objects. G was optimized using gradient ascent, using the following weight-update rule: * i + σ6≡ j δG i i = [G − Gj · Cov6≡ j (D , D)] , ∀j 1wi ∝ δwi σ6≡ j j

(2.1)

where Cov6≡ j (Di , D) =

i h(Djk − Di 6≡ j )(Djk − D6≡ j )i, ∀k 6≡ j

σ6≡i j σ6≡ j

is the covariance of the one-dimensional distance (for feature i) and the weighted N-dimensional city-block distance from view j to views of different objects. Loosely speaking, the square bracketed term of equation 2.1 indicates that a feature’s weight tends to increase until its individual goodness accounts for a specific fraction of the global feature goodness, where the fraction is given by the covariance term. 2.4 Assessing Recognition Performance. SEEMORE’s recognition performance was assessed quantitatively as follows. A test set consisting of 600 novel views (100 objects × 6 views) was culled from the database, as previously described. In some cases, the raw images were preprocessed according to one of the five degradation conditions (scrambled, occluded, colorized, noisy, or cluttered). Then the intact or degraded images were presented to SEEMORE for identification. In the course of this work, it was noted empirically that a compressive transform on the feature dimensions (histogram values) led to improved classification performance; prior to all learning and recognition operations, therefore, each feature value was replaced by its natural logarithm (0 values

794

Bartlett W. Mel

were first replaced with a small positive constant to prevent the logarithm from blowing up). For each test view j, the distance Djt was computed for every training view t in the database, and the nearest neighbor was chosen as the best match.2 The logarithmic transform of the feature dimensions thus tied D to the ratios of individual feature values in two images rather than their arithmetic differences. 3 Results 3.1 Intact Images. Recognition time on a Sparc-20 was 1–2 minutes per view; the bulk of the time was devoted to shape processing, with under 2 seconds required for matching. Recognition results are summarized in Table 1, reported as the proportion of test views that were correctly classified. Performance using all 102 channels for the 600 novel object views in the intact test set was 96.7 percent; the chance rate of correct classification was 1 percent. Across recognition conditions, second-best matches usually accounted for approximately half the errors. Results were broken down in terms of the separate contributions to recognition performance of color-related versus shape-related feature channels. Performance using only the 23 color-related channels was 87.3 percent, and using only the 79 shape-related channels was 79.7 percent. Remarkably, very similar performance figures were obtained for the subset of 90 test views of the nonrigid objects, which included several scarves, a bike chain, necklace, belt, sock, necktie, maple leaf cluster, bunch of grapes, knit bag, and telephone cord. Thus, a novel random configuration of a telephone cord was as easily recognized as a novel view of a shovel. 3.2 Degraded Images. Results for the scrambled, occluded, colorized, noisy, and cluttered test sets are also shown in Table 1. Under the scrambling manipulation, overall recognition performance was only modestly affected, due primarily to the stability of the color channels under this manipulation, while the shape-related channels showed a significant drop in performance as expected (from 80 to 62 percent). When test views were half-occluded, recognition performance fell to 79 percent, due to the disruptive effect of occlusion on both color- and shape-related channels; the

2 In informal experimentation with a number of variants of the nearest-neighbor classifier, including a variety of different distance metrics, typically only small differences in performance were seen, and the reasons for these changes in performance were obscure. This work emphasized the power of the visual representation rather than the benchmarking of standard classifiers and their variants; given that nearest-neighbor classifiers are fast and conceptually simple, and are known to perform well often in high-dimensional spaces in comparison to more sophisticated classifiers (Friedman, 1994), this single method was used throughout the trials reported in this article.

SEEMORE

795

Table 1: Summary of Results. Intact Nonrigid Scrambled Occluded Cluttered Colorized Noisy Shape only Color only Color and shape

79.7 87.3 96.7

76.7 94.4 97.8

62.2 86.5 93.7

38.2 72.2 79.0

57.3 61.2 79.0

43.5 6.8 19.8

35.8 47.2 58.3

relatively severe disruption of the shape representation was in part due to the more spatially heterogeneous distribution of distinctive (i.e., necessary) shape cues and in part to the introduction of spurious contours and other accidental features at the occluding boundary. The color shift manipulation cut recognition performance to under 20 percent, an indication of the complete lack of color constancy in SEEMORE’s feature channels. Note that while shape-related channels did not explicitly indicate the color of origin of their respective features, they depended on both color and intensity and were thus affected by the colorization manipulation as well. The clutter condition, which simulated the presence of a distractor object, dropped correct recognition rate to just below 80 percent. The additive noise condition disrupted recognition performance along both shape and color axes, cutting overall performance to 58 percent, indicative of poor high-frequency noise tolerance in the underlying feature channels. 3.3 Generalization Behavior and Recognition Errors. Numerical indexes of recognition performance are useful but do not explicitly convey the similarity structure of the underlying feature space. A more qualitative but extremely informative representation of system performance lies in the sequence of images in order of increasing distance from a test view. Records of this kind are shown in Figure 9 for several recognition trials. In each, a test view is shown at the far left highlighted in red (gray), and the ordered set of nearest neighbors is shown to the right. When a test view’s nearest neighbor (second image from left) was not the correct match, the trial was classified as a recognition error in Table 1. In Figure 9A, generalization behavior can be seen when only color channels were used for recognition, illustrating the limits of a simple “color histogramming” system. Errors of this kind occurred for objects with very similar color content and were caused by changes in lighting conditions that tipped the balance in favor of an incorrect object match. In Figure 9B, generalization behavior can be seen when only shaperelated channels were used for recognition. Thus, as shown in row 1, a view of a book is judged most similar to a series of other books (or the bottom of a rectangular cardboard box)—each a view of a rectangular object

796

Bartlett W. Mel

A

Color Only

B

Shape Only

C

Color and Shape

Figure 9: Generalization and errors. In each row, a novel test view is shown at the left (outlined in red). The sequence of best matching training views (one per object) is shown to the right, in order of decreasing similarity. (A) Examples of color generalization and errors (i.e., excluding shape-related features). (B) Examples of shape and texture generalization and errors, excluding color features. (C) Examples of generalization and errors using all 102 color- and shape-related channels.

SEEMORE

797

with high-frequency surface markings. A similar sequence can be seen in subsequent rows for (2) a series of cans, each a right cylinder with detailed surface markings, (3) a series of smooth, not-quite-round objects, (4) a series of photographs of complex scenes, and (5) a series of dinosaurs (followed by a teddy bear). In certain cases, SEEMORE’s shape-related similarity metric was more difficult to interpret visually or verbalize (last two rows), or was substantially different from that of a human observer. When all 102 color and shape-related channels were used for recognition, the few remaining errors occurred among views of objects that shared common shape and color features (see Figure 9C). These errors plainly illustrate the limitations of SEEMORE’s visual discrimination power, exemplified by chronic confusion among the Coke can, rubber cement can, and a Campbell’s soup can. The confusion among these three similarly proportioned right cylinders with red and white surface markings accounted for 3 of the 20 errors for the intact test set. An additional 3 errors were explained by confusion between two similar photographs. 4 Discussion 4.1 Recognition Performance and Generalization Behavior. As can be seen in Table 1, color alone is a remarkably powerful cue for object recognition, even in the total absence of shape information. This result is consistent with the results of Swain and Ballard (1991), who successfully recognized objects using color histograms alone and first drew the tentative analogy between histogram bins and neural receptive fields. However, when the color signatures of objects mimic each other and lead to recognition errors, the errors are generally “inexcusable” to a human observer (see Figure 9A). Thus, despite the utility of color for identification of specific instances of objects, the preeminence of shape information in object vision is clear, both for the definition of object categories (Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976) and in the process of rapid object naming (Biederman & Ju, 1988). In this context, several aspects of SEEMORE’s shape representations deserve closer examination. First, the ecology of natural object vision gives rise to an apparent contradiction: Generalization in shape space must in some cases permit an object whose global shape has been grossly perturbed to be matched to itself, such as the various tangled forms of a telephone cord, but quasi-rigid basic-level shape categories (e.g., chair, shoe, tree) must be preserved as well and distinguished from each other. A partial resolution to this conundrum lies in the observation that locally computed shape statistics are in large part preserved under the global shape deformations that nonrigid common objects (e.g., scarf, bike chain) typically undergo. A feature-space representation with an emphasis on locally derived shape channels will therefore exhibit a significant degree of invariance to global nonrigid shape deformations. The principal sources

798

Bartlett W. Mel

of change to the local image-plane shape statistics under nonrigid shape deformations include (1) actual bending, rending, or warping of the object surfaces; (2) visual foreshortening as object surfaces change orientation in depth; (3) the addition and subtraction of object features that move in and out of self-occlusion; and (4) the addition and subtraction of spurious local shape statistics across accidental self-occluding boundaries. (Each of the latter three categories of feature changes applies to rigid objects as well.) The representational challenge, then, is to accumulate a sufficiently rich set of local shape statistics, which individually “tolerate” some degree of bending, warping, and foreshortening (categories 1 and 2) so that as a group they overwhelm in importance the feature-space excursions due to “capricious” feature changes (categories 3 and 4). The situation for rigid objects is somewhat less problematic, in that shape statistics are stable over longer spatial scales, and the problems associated with self-occlusion are generally less severe. The definition of shape similarity embodied in the present approach is that two objects are similar if they contain similar profiles (histograms) of their shape measures, which emphasize locality. One way of understanding the emergence of global shape categories, then, such as “book,” “can,” and “dinosaur,” is to view each as a set of instances of a single canonical object whose local shape statistics remain quasi-stable as it is warped into various global forms. In many cases, particularly within rigid object categories, exemplars may share longer-range shape statistics as well. It is useful to consider one further aspect of SEEMORE’s shape representation, pertaining to an apparent mismatch between the simplicity of the shape-related feature channels and the complexity of the shape categories that can emerge from them. Specifically, the order of binding of spatial relations within SEEMORE’s shape channels is relatively low, consisting of single simple open or closed curves, or conjunctions of two oriented contours or Gabor patches. The fact that shape categories, such as “photographs of rooms” or “smooth, lumpy objects,” cluster together in a feature space of such low binding order would therefore at first seem surprising. This phenomenon relates closely to the notion of “wickelfeatures” (Wickelgren, 1969; see also Rumelhart & McClelland, 1986, Chap. 18), where features that bind spatial information only locally—before being globally pooled—are nonetheless able to represent global patterns (words) with little or no residual ambiguity. For example, features sensitive to individual letters—but not their positions—are not sufficient to distinguish words such as ngtcnrooiie and recognition. In contrast, features sensitive to letter pairs, analogous to SEEMORE’s shape features, are typically sufficient to eliminate representational ambiguity (i.e., multiple inverse images), even in the absence of relative positional information among the detected pairs. Thus, not only does no other English word contain the same set of adjacent letter pairs as recognition (co, ec, gn, io, it, ni, og, on, re, ti), but no other possible word contains the same letter pairs (conjecture without proof!). One mechanism

SEEMORE

799

underlying this phenomenon is that the overlapping subcomponents of the various features provide a form of implicit glue or binding that probabilistically constrains the spatial relations between features, even though these are not given explicitly within the representation. Thus the e in ec is likely to be the same c as in re, producing a high a posteriori probability for the compound rec. It is important to recall, however, that in direct opposition to the pressure for precise representation of shape is the pressure to generalize across different shapes within the same shape category or between views of the same object in different configurations. One strategy for coping with this trade-off may be to maintain separate representations at a range of specificities, coupled with a source of expertise to choose between them. 4.2 Rationale for Choice of Feature Channels. Following two main guidelines, SEEMORE’s feature channels were designed as abstractions of known feature types seen in the ventral stream of the primate visual system and to represent information in images that is known to be important for object perception. The level of abstraction was roughly as follows: color-sensitive cells, texture-sensitive cells, cells responding to contour conjunctions, cells distinguishing curved from straight, and so forth. Within these general design guidelines, the specific selectivities and spatial invariances that defined SEEMORE’s feature channels were largely shaped by two additional forces: (1) the contents of the object database, which implicitly rendered certain image cues more salient than others for purposes of classification, and (2) the relatively severe, monolithic nature of the recognition task, which entailed nearly complete viewpoint uncertainty from trial to trial. Both of these influences make it likely that the details of SEEMORE’s feature channels differ from those of any real animal, whose image statistics inevitably differ from those contained in SEEMORE’s database, and for whom very different viewpoint assumptions, requiring different degrees of invariance, hold in different behavioral contexts. As such, SEEMORE’s “neurophysiology” (feature channel responses) and “psychophysics” (recognition performance and patterns of generalization) do not yet provide a detailed or comprehensive model for the neurophysiological or psychophysical record for any particular animal species. Indeed, the fitting of existing empirical data along these lines has not been a goal of this work so far. On the other hand, SEEMORE’s representations fall easily within the family of receptivefield-based representations seen in biological visual systems and, under many of the same challenges faced by “real” visual systems, have exhibited levels of recognition performance and patterns of generalization that seem surprising given the simplicity of the underlying computations. Given that the scale of visual hardware in the ventral stream of the primate visual system devoted to object recognition likely outstrips SEEMORE’s 102 feature channels by multiple orders of magnitude, it seems a reasonable possibility that the gap between the performance of a visual “simpleton” like SEEMORE

800

Bartlett W. Mel

and his much more gifted cousins in the animal kingdom is largely a matter of scale. In contrast to the engineering-driven approach adopted here, features could have in principle been learned. Unsupervised learning approaches have been used to discover statistical structure in natural images that is strikingly close to the structure of simple cell-receptive fields in the mammalian visual system (Olshausen & Field, 1996). In the context of recognition, however, the problem of feature learning is more difficult in that statistical structure present in an ensemble of images may or may not be useful. Utility of image features depends on the specific classification problem that the animal faces, which can change from moment to moment; consider the processing of faces for identity versus age versus gender versus emotional state, and so forth. The discovery of useful structure is thus a problem requiring task-dependent supervision and requiring potentially large quantities of labeled data. If the features needed for recognition are highly descriptive, as needed in difficult recognition domains, then the search spaces for either unsupervised or supervised approaches to feature learning are huge. High representational biases could perhaps make such approaches more tractable.

4.3 Limitations and Extensions. The presegmentation of objects in the test sets used here is a simplifying assumption that is clearly invalid in the real world. The advantage of the assumption from a methodological perspective is that the object similarity structure induced by the feature dimensions can be studied independent of the problem of segmenting or indexing objects embedded in complex scenes. Thus, a representation that does not permit good discrimination among a large number of objects over a range of 3D viewpoints and internal configurations and does not impose appropriate category structure among objects need not be considered further as a component of a system that deals with objects embedded within scenes. The cluttered test condition (see Figure 3C) was designed to assay explicitly the sensitivity of SEEMORE’s visual representations to additive pattern noise. While the figures in Table 1 show a seven-fold increase in error rate for cluttered test images relative to the intact condition (21 percent versus 3 percent), these figures must be interpreted with care. First, as for the case of intact images, approximately half (44 percent) of the “errors” in the cluttered case were second-best matches from a database of 100 objects and so do not represent total recognition failure. Moreover, the act of cluttering (or occluding or color shifting) a test image can introduce legitimate accidental similarities between the modified test view and a view of another (i.e., “incorrect”) object. Since SEEMORE is entirely unsuspecting of these manipulations and is asked simply to compare incoming views to familiar views at face value, some of the committed errors may be excusable, to the

SEEMORE

801

extent that the degraded object view does look most similar to a view of an incorrect object to the eye of a human observer. When pattern clutter is made more severe than that illustrated in Figure 3C, however, recognition performance suffers dramatically. One of the main reasons for this is that SEEMORE’s feature channels are currently far from sparse in their activity profiles across images; most feature channels are activated to some degree by most object views. Under these circumstances, the representations of multiple objects in the visual field additively collide within the same feature channels, making separate recognition difficult or impossible. Two kinds of remedies for this problem have been suggested by biology and prior work. The first is crude image segmentation based on a movable fovea or other focal attention system (Koch & Ullman, 1985; Niebur & Koch, 1994). This type of strategy typically minimizes, but cannot eliminate, the additive visual noise induced by extraneous objects outside the focus of attention. The second, more potent remedy is the leap to sparse, very-high-dimensional space, whose mathematical advantages for vision and recognition have been discussed at length elsewhere (Kanerva, 1988; Califano & Mohan, 1994). For example, in a domain of 2D line drawings, Califano and Mohan (1994) have demonstrated position, orientation, and scale-invariant recognition in multiple-object scenes with no prior segmentation, using on the order of 106 highly descriptive contour-based features (based on conjunctions of triplets of contour segments). Intuitively, an object is recognized when a sufficient number of sufficiently distinctive features “vote” for the presence of a particular object, regardless of whether other objects also get votes. The leap into high-dimensional feature space is easily arranged combinatorially, by conjoining existing (or other) low-order features into compound features of higher order. The issue of feature locality remains crucial, however. If the desired combinatorial explosion is achieved by conjoining elemental shape tokens over relatively long distances in the image, the resulting shape-based similarity metric can permit extremely fine shape distinctions but may lead to poor generalization under nonrigid shape deformation and partial occlusion. This approach is most relevant in feature-poor images, in which the variety of local token configurations is limited. In feature-dense images, such as natural images, the combinatorial explosion can be achieved while maintaining locality within elemental features configurations, thus preserving the constellation of desirable representational properties that locality confers. One means of effectively increasing the richness of localized feature configurations is to include elemental filters simultaneously sensitive to multiple visual cues, such as contour, color, shading, and texture cues. Interestingly, the preferred responses of pattern-selective cells in the primate IT cortex frequently involve explicit binding of contour shapes with particular colors and/or textures (Kobatake & Tanaka, 1994). In this context, we may interpret these data as evidence of an implicit neural strategy whose

802

Bartlett W. Mel

goal is to achieve sparse high dimensionality in order to simplify object indexing in cluttered scenes, while retaining locality, in order to facilitate viewpoint-, configuration-, and occlusion-insensitive shape-based generalization. In continuing work, we are pursuing this course. Acknowledgments Thanks to Jozsef ´ Fiser for useful discussions and for development of the Gabor-based channel set, to Dan Lipofsky and Scott Dewinter for helping in the construction of the image database, and to Christof Koch for providing support at Cal Tech where this work was initiated. This work was funded by the Office of Naval Research, and the McDonnell-Pew Foundation. References Amit, Y., Geman, D., & Wilder. K. (1995). Recognizing shapes from simple queries about geometry (Tech. Rep.). Chicago: University of Chicago, Department of Statistics. Biederman, I., & Ju, G. (1988). Surface vs. edge-based determinants of visual recognition. Cognitive Psychology, 20, 38–64. Califano, A., & Mohan, R. (1994). Multidimensional indexing for recognizing visual shapes. IEEE Trans. on PAMI, 16, 373–392. Desimone, R., Albright, T. D., Gross, C., & Bruce, C. (1984). Stimulus selective properties of interior temporal neurons in the macaque. J. Neurosci., 4, 2051– 2062. DiGirolamo, G., & Kanwisher, N. (1995). Accessing stored representations begins within 155 ms in object recognition. Paper presented at the 36th Annual Meeting of the Psychonomics Society, Los Angeles. Duda, R. O., & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley. Edelman, S., & Bulthoff, H. (1992). Orientation dependence in the recognition of familiar and novel views of three-dimensional objects. Vision Res., 32, 2385– 2400. Friedman, J. H. (1994). Flexible metric nearest neighbor classification (Tech. Rep.). Stanford University, Department of Statistics and Stanford Linear Accelerator Center. Fukunaga, K. (1990). Introduction to statistical pattern recognition. New York: Academic Press. Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Trans. Sys. Man and Cybernetics, SMC-13, 826–834. Gochin, P. M., Colombo, M., Dorfman, G. A., Gerstein, G., & Gross, C. (1994). Neural ensemble coding in inferior temporal cortex. J. Neurophysiol., 71, 2325– 2337. Gross, C., Rocha-Miranda, C., & Bender, D. (1972). Visual properties of neurons in inferotemporal cortex of the macaque. J. Neurophysiol., 35, 96–111.

SEEMORE

803

Ito, M., Tamura, H., Fujita, I., & Tanaka, K. (1995). Size and position invariance of neuronal responses in monkey inferotemporal cortex. J. Neurophysiol., 73, 218–226. Kanerva, P. (1988). Sparse distributed memory. Cambridge, MA: MIT Press. Kobatake, E., & Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. J. Neurophysiol., 71, 856–867. Koch, C., & Ullman, S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiol., 4, 219–227. Lades, M., Vorbruggen, J., Buhmann, J., Lange, J., Malsburg, C., Wurtz, R., & Konen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Computers, 42, 300–311. Le Cun, Y., Matan, O., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel, L., & Baird, H. (1990). Handwritten zip code recognition with multilayer networks. In Proc. of the 10th Int. Conf. on Patt. Rec. Los Alamitos, CA: IEEE Computer Science Press. Logothetis, N., Pauls, J., Bulthoff, H., & Poggio, T. (1994). View-dependent object recognition by monkeys. Current Biology, 4, 401–414. Mel, B. (1996). Seemore: A view-based approach to 3-D object recognition using multiple visual cues. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems (Vol. 8, pp. 865–871). Cambridge, MA: MIT Press. Mishkin, M. (1982). A memory system in monkey. Philos. Trans. R. Soc. Lond. B., 298, 85–95. Murase, H., & Nayar, S. (1995). Visual learning and recognition of 3-D objects from appearance. International Journal of Computer Vision, 14, 5–24. Niebur, E., & Koch, C. (1994). A model for the neuronal implementation of selective visual attention based on temporal correlation among neurons. Journal of Computational Neuroscience, 1(1), 141–158. Olshausen, B., & Field, D. (1996). Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature, 381, 607–609. Oram, M., Perrett, D. (1992). Time course of neural responses discriminating different views of the face and head. J. Neurophysiol., 68(1), 70–84. Perrett, D., Rolls, E., & Caan, W. (1982). Visual neurons responsive to faces in the monkey temporal cortex. Exp. Brain Res., 47, 329–342. Pitts, W., & McCullough, W. (1947). How we know universals: the perception of auditory and visual forms. Bull. Math. Biophys., 9, 127–147. Potter, M. (1976). Short-term conceptual memory for pictures. J. Exp. Psychol.: Human Learning and Memory, 2, 509–522. Rao, R., & Ballard, D. (1995). An active vision architecture based on iconic representations. Artificial Intelligence, 78, 461–505. Rosch, E., Mervis, C., Gray, W., Johnson, D., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382–439. Rumelhart, D., & McClelland, J. (1986). Parallel distributed processing. Cambridge, MA: MIT Press.

804

Bartlett W. Mel

Schiele, B., & Crowley, J. (1996). Probabilistic object recognition using multidimensional receptive field histograms. In Proc. of the 13th Int. Conf. on Patt. Rec. (Vol. 2, pp. 50–54). Los Alamitos, CA: IEEE Computer Society Press. Schwartz, E., Desimone, R., Albright, T., & Gross, C. (1983). Shape recognition and inferior temporal neurons. Proc. Natl. Acad. Sci. USA, 80, 5776–5778. Subramaniam, S., Biederman, I., Kalocsai, P., & Madigan, S. R. (1995). Accurate identification, but chance forced-choice recognition for rsvp pictures. Poster presented at the Annual Meeting of the Association for Research in Vision and Ophthalmology, Ft. Lauderdale, FL. Swain, M. J., & Ballard, D. (1991). Color indexing. Int. J. Computer Vision, 7, 11–32. Viola, P. (1993). Feature-based recognition of objects. In Proc. of the AAAI Fall Symposium on Learning and Computer Vision, 60–65. Wickelgren, W. A. (1969). Context-sensitive coding, associative memory, and serial order in (speech) behavior. Psych. Rev., 76, 1–15.

Received April 29, 1996; accepted September 19, 1996.

Communicated by Peter Konig

Image Segmentation Based on Oscillatory Correlation DeLiang Wang Department of Computer and Information Science and Center for Cognitive Science, The Ohio State University, Columbus, Ohio 43210, USA

David Terman Department of Mathematics, The Ohio State University, Columbus, Ohio 43210, USA

We study image segmentation on the basis of locally excitatory, globally inhibitory oscillator networks (LEGION), whereby the phases of oscillators encode the binding of pixels. We introduce a lateral potential for each oscillator so that only oscillators with strong connections from their neighborhood can develop high potentials. Based on the concept of the lateral potential, a solution to remove noisy regions in an image is proposed for LEGION, so that it suppresses the oscillators corresponding to noisy regions but without affecting those corresponding to major regions. We show that the resulting oscillator network separates an image into several major regions, plus a background consisting of all noisy regions, and we illustrate network properties by computer simulation. The network exhibits a natural capacity in segmenting images. The oscillatory dynamics leads to a computer algorithm, which is applied successfully to segmenting real gray-level images. A number of issues regarding biological plausibility and perceptual organization are discussed. We argue that LEGION provides a novel and effective framework for image segmentation and figure-ground segregation. 1 Introduction The segmentation of a visual scene (image) into a set of coherent patterns (objects) is a fundamental aspect of perception that underlies a variety of tasks, such as image processing, figure-ground segregation, and automatic target recognition. Scene segmentation plays a critical role in the understanding of natural scenes. Although humans perform it with apparent ease, the general problem of image segmentation remains unsolved in sensory information processing. As the technology of single-object recognition has become more and more advanced, the demand for a solution to image segmentation has increased since both natural scenes and manufacturing applications of computer vision are rarely composed of a single object. Objects appear in a natural scene as the grouping of similar sensory features and the segregation of dissimilar ones. Sensory features are generally Neural Computation 9, 805–836 (1997)

c 1997 Massachusetts Institute of Technology °

806

DeLiang Wang and David Terman

taken to be local, and in the simplest case they may correspond to single pixels. To approach the problem of scene segmentation, three basic issues must be addressed: What are the cues that determine grouping and segregation? What is the proper representation for the result of segmentation? How are the cues used to give rise to segmentation? Much is known about sensory cues that are important for segmentation. In particular, Gestalt psychology has uncovered a set of principles guiding the grouping process in the visual domain (Wertheimer, 1923; Koffka, 1935; Rock & Palmer, 1990). These principles work together to produce segmentation. We briefly summarize some of the most important principles (see also Rock and Palmer, 1990): • Proximity. The closer the features lie to each other, the easier they are to be grouped into the same segment. • Similarity. Features that have similar attributes, such as grayness, color, depth, or texture, tend to group together. • Common fate. Features that have similar temporal behavior tend to group together. For instance, a group of features that move coherently (common motion) would form a single object. Notice that common fate may be regarded as one aspect of similarity. We list it separately to emphasize the importance of time as a separate dimension. • Connnectedness. A uniform, connected region, such as a spot, line, or more extended area, tends to form a single segment. • Good continuation. A set of features that form a smooth and continuous curve tend to group together. • Prior knowledge. If a set of features belong to the same familiar pattern, they tend to group together. In computer vision algorithms for image segmentation, the result of segmentation can be represented in many ways. However, it is not a trivial task to represent the outcome of segmentation in a neural network. One proposal is naturally derived from the so-called neuron doctrine (Barlow, 1972), where neurons at higher brain areas are assumed to become more selective and eventually a single neuron represents each single object (the grandmother cell representation). Multiple objects in a visual scene would be represented by the coactivation of multiple units at some level of the nervous system. This representation faces major theoretical and neurobiological problems (von der Malsburg, 1981; Abeles, 1991; Singer, 1993). Another proposal relies on temporal correlation to encode the binding (Milner, 1974; von der Malsburg, 1981; Abeles, 1982). In particular, the correlation theory of von der Malsburg (1981) asserts that an object is represented by the temporal correlation of the firing activities of the scattered cells that encode different features of the object. Multiple objects are represented by differ-

Image Segmentation

807

ent correlated firing patterns that alternate in time, each corresponding to a single object. Temporal correlation provides an elegant way to represent the result of segmentation. A special form of temporal correlation is oscillatory correlation, where the basic unit is a neural oscillator (see Terman & Wang, 1995; Wang & Terman, 1995a). However, this representation does not by itself reveal how segmentation is achieved using Gestalt grouping principles. Despite an extensive body of literature dealing with segmentation using temporal correlation (starting perhaps from von der Malsburg & Schneider, 1986), little progress has been made in building successful neural systems for image segmentation. There are two major challenges facing the oscillatory correlation theory. The first challenge is how to achieve fast synchronization within a population of locally coupled oscillators. Most of the models proposed for achieving phase synchrony rely on all-to-all connections (see Section 2 for more details). However, as Sporns, Tononi, and Edelman (1991) and Wang (1993a) pointed out, a network with full connections indiscriminately connects all the oscillators activated simultaneously by different objects because the network is dimensionless and loses critical information about geometry. The second challenge is how to achieve fast desynchronization among different groups of oscillators representing distinct objects. This is necessary in order to segment multiple objects simultaneously presented. We have previously proposed a neural network framework to deal with the problem of image segmentation, called locally excitatory globally inhibitory oscillator networks (LEGION) (Wang & Terman, 1995a; Terman & Wang, 1995). Each oscillator is modeled as a standard relaxation oscillator. Local excitation is implemented by positive coupling between neighboring oscillators, and global inhibition is realized by a global inhibitor. LEGION exhibits the mechanism of selective gating, whereby oscillators stimulated by the same pattern tend to synchronize due to local excitation and oscillator groups stimulated by different patterns tend to desynchronize due to global inhibition (Wang & Terman, 1995a; Terman & Wang, 1995). We have proven that with the selective gating mechanism, LEGION rapidly achieves both synchronization within groups of oscillators that are stimulated by connected regions and desynchronization between different groups. In sum, LEGION provides an elegant solution to both challenges outlined above. In this article, we study LEGION for segmenting real images. Before we demonstrate image segmentation, the original version of LEGION needs to be extended to handle images with many tiny (noisy) regions. One such example is shown in Figure 1, where three objects with a noisy background form a visual image. Without extension, LEGION would treat each region, no matter how small it is, as a separate segment. Thus, it would lead to many tiny fragments. We call this problem fragmentation. A more serious problem is that it is difficult to choose parameters so that LEGION is able to achieve more than several (5 to 10) segments (Terman & Wang, 1995). Noisy fragments may therefore compete with major image regions for becoming

808

DeLiang Wang and David Terman

Figure 1: A caricature of an image with three objects that appears on a noisy background. The noiseless caricature is adapted from Terman and Wang (1995).

segments, so that it may not be possible to extract the major segments from an image. The problem of fragmentation is solved by introducing a concept of lateral potential for each oscillator. The extended dynamics is fully analyzed (Wang & Terman, 1996), and the resulting LEGION network is applied to gray-level images and yields successful segmentation. A preliminary version of this work was presented in Wang and Terman (1995b). In the next section we review prior work relevant to image segmentation and neural networks. In section 3, our model is described in detail. In section 4, computer simulations of the extended LEGION network are presented. Section 5 presents the segmentation results on real images. Further discussions concerning our approach are given in section 6. 2 Related Work 2.1 Image Segmentation Algorithms. Due to its critical importance for computer vision, image segmentation has been studied extensively. Many techniques have been invented (for reviews of the subject see Zucker, 1976;

Image Segmentation

809

Haralick, 1979; Haralick & Shapiro, 1985; Sarkar & Boyer, 1993b). There are three broad categories of algorithms: pixel classification, edge-based organization, and region-based segmentation. A simple classification technique is thresholding: A pixel is assigned a specific label if some measure of the pixel passes a certain threshold. This idea can be extended to a complex form, including multiple thresholds, which are determined by pixel histograms (Kohler, 1981). Edge-based techniques generally start with an edge-detection algorithm, which is followed by grouping edge elements into rectilinear or curvilinear lines. These lines are then grouped into boundaries that can be used to segment images into various regions (see, for example, Geman, Geman, Graffigne, & Dong 1990; Sarkar & Boyer, 1993a; Foresti, Murino, Regazzoni, & Vernazza, 1994). Finally, region-based techniques operate directly on regions. A classical method is region growing and splitting (or split-and-merge; see Horowitz & Pavlidis, 1976; Zucker, 1976; Adams & Bischof, 1994), where iterative steps are taken to grow (or split) pixels into a connected region if all the pixels in the region satisfy some conditions. One of the apparent deficits with these algorithms is their iterative (serial) nature (Liou, Chiu, & Jain, 1991). There are some recent algorithms that are partially parallel (Liou, Chiu, & Jain, 1991; Mohan & Nevatia, 1992; Manjunath & Chellappa, 1993). Most of these techniques rely on domain-specific heuristics to perform segmentation, and no unified computational framework exists to explain the general phenomenon of scene segmentation (Haralick & Shapiro, 1985). The problem of scene segmentation is computationally hard (Gurari & Wechsler, 1982) and largely regarded as unsolved. 2.2 Neural Network Efforts. Neural networks have proven to be a successful approach to pattern recognition (Schalkoff, 1992; Wang, 1993b). Unfortunately, little work has been devoted to scene segmentation, which is generally regarded as part of preprocessing (often meaning manual segmentation). Scene segmentation is a particularly challenging task for neural networks, partly because traditional neural networks lack the representational power for encoding multiple objects simultaneously. Interesting schemes of segmentation based on learning have been proposed (Sejnowski & Hinton, 1987; Mozer, Zemel, Behrmann, & Williams, 1992). Grossberg and Wyse (1991) proposed a model for segmentation based on the contour detection model of Grossberg and Mingolla (1985; see Gove, Grossberg, & Mingolla, in press, for a computer simulation). However, all of these methods were tested on only small synthetic images, and it is not clear how they can be extended to handle real images. Also, Kohonen’s self-organizing maps have been used for segmentation based on pixel classification (Kohonen, 1995; Koh, Suk, & Bhandarkar, 1995). A primary drawback of these methods is that the number of segments (objects) is assumed to be known a priori. Because temporal (oscillatory) correlation offers an elegant way of representing multiple objects in neural networks (von der Malsburg & Schnei-

810

DeLiang Wang and David Terman

der, 1986), most of the neural network efforts on image segmentation have centered around this theme. In particular, the discovery of synchronous oscillations in the visual cortex has triggered much interest in exploring oscillatory correlation to solve the problems of segmentation and figure-ground segregation. One type of model uses all-to-all connections to reach synchronization (Wang, Buhmann, & Malsburg, 1990; Sompolinsky, Golomb, & Kleinfeld, 1991; von der Malsburg and Buhmann, 1992). As explained in section 1, these models cannot extend very far in solving the segmentation problem because fundamental information concerning the geometry among sensory features is lost. Another type of model uses lateral connections to reach synchrony (Sporns, et al., 1991; Murata & Shimizu, 1993; Schillen & Konig, ¨ 1994). Unfortunately, it is unclear to what extent these oscillator networks can synchronize on the basis of local connectivity since no analysis is given and only simulation results on small networks are provided. Moreover, recent insights into the contrasting behavior between sinusoidal and relaxation oscillators make it clear that sinusoid-typed oscillators, which encompass most of the oscillator models used, have severe limitations to support fast synchronization (Wang, 1995; Terman & Wang, 1995; Somers & Kopell, 1995). In fact, in all of the above models, nothing close to a real image has ever been used for testing these models. 3 Model Description The building block of LEGION, a single oscillator i, is defined as a feedback loop between an excitatory unit xi and an inhibitory unit yi , whose time derivatives are defined as x0i = 3xi − x3i + 2 − yi + Ii H(pi + exp(−αt) − θ ) + Si + ρ

(3.1a)

y0i = ε(γ (1 + tanh(xi /β)) − yi ).

(3.1b)

Here H stands for the Heaviside step function, which is defined as H(ν) = 1 if ν ≥ 0 and H(ν) = 0 if ν < 0. Ii represent external stimulation, which is assumed to be applied from time 0 on, and Si denotes the coupling from other oscillators in the network. ρ denotes the amplitude of gaussian noise, the mean of which is set to −ρ. The negative mean is used to reduce the chance of self-generating oscillations, which will become clear in the next paragraph. The noise term is introduced for two purposes. The first one is obvious: to test the robustness of the system. The second one, perhaps more important, is to play an active role in separating different input patterns (for more discussion, see Terman & Wang, 1995). The parameter ε is a small, positive number. Hence equation 3.1, without any coupling or noise and with constant stimulation, corresponds to a standard relaxation oscillator. The x-nullcline of equation 3.1 is a cubic curve, while the y-nullcline is a sigmoid function. If I > 0 and H = 1, these curves

Image Segmentation

811

intersect along the middle branch of the cubic when β is small. In this case, we call the oscillator enabled (see Figure 2A). It produces a stable periodic orbit, which alternates between silent and active phases of near steady-state behavior. As shown in Figure 2A, the silent and the active phases correspond to the left L and the right R branches of the cubic, respectively. The transitions between the two phases occur rapidly (and are thus referred to as jumping). Notice that the trajectory of the oscillator in phase space jumps between the two branches and then follows closely the branches, because the small ε induces very different time scales for x- and y-dynamics. The parameter γ is used to control the ratio of the times that the solution spends in these two phases. For a larger value of γ , the solution spends a shorter time in the active phase. If I ≤ 0 and H = 1, the nullclines of equation 3.1 intersect at a stable fixed point along the left branch of the cubic (see Fig. 2B). In this case, equation 3.1 produces no periodic orbit, and the oscillator is referred to as excitable, indicating that the oscillator has not yet been but can be excited by stimulation. An excitable oscillator may become oscillatory if it receives, through the term S, large enough coupling from other oscillators. Because of this dependency on external stimulation, the oscillations are stimulus dependent. We say that the oscillator is stimulated if I > 0, and unstimulated if I ≤ 0. The parameter β specifies the steepness of the sigmoid function and is chosen to be small. The oscillator model (see equation 3.1) may be interpreted as a model for the spiking behavior of a single neuron, the envelope of a bursting neuron, or a mean field approximation to a network of excitatory and inhibitory binary neurons. The primary difference between equation 3.1 and the model in Terman and Wang (1995) is the introduction of the Heaviside function in which α > 0 and 0 < θ < 1. The parameter α is chosen to be on the same order of magnitude as ε so that the exponential function decays on a slow time scale. It is the Heaviside term that allows the network to distinguish between major blocks and noisy fragments. The basic idea is that a major block must contain at least one oscillator, denoted as a leader, which lies in the center of a large, homogeneous region. This oscillator will be able to receive large lateral excitation from its neighborhood. A noisy fragment does not contain such an oscillator. The variable pi in equation 3.1a determines whether an oscillator is a leader. It is referred to as the lateral potential of the oscillator i, and it satisfies the differential equation: " # X 0 Tik H(xk − θx ) − θp − µpi . (3.2) pi = λ(1 − pi )H k∈N(i)

Here λ > 0, Tik is the permanent connection weight (explained later) from oscillator k to i, and N(i) is called the neighborhood of i. If the weighted sum oscillator i receives from N(i) exceeds the threshold θp , pi approaches 1. If this weighted sum is below θp , pi relaxes to 0 on a time scale determined by µ, which is chosen to be on the same order as ε resulting in a slow time scale.

812

DeLiang Wang and David Terman

Figure 2: Nullclines and orbits of a single oscillator. (A) If I > 0 and H = 1, the oscillator is enabled. The periodic orbit is shown with a bold curve, and its direction of motion is indicated by the arrowheads. The left and the right branches of the x-nullcline are labeled L and R, respectively. LK and RK indicate the left and the right knees of the cubic, respectively. (B) If I ≤ 0 and H = 1, the oscillator is excitable. The fixed point PI on the left branch of the cubic is asymptotically stable.

Image Segmentation

813

Figure 3: Architecture of a two-dimensional LEGION network with four nearest-neighbor coupling. An oscillator is indicated by an open circle, and the global inhibitor is indicated by the filled circle.

It follows that pi can exceed the threshold θ in equation 3.1a only if i is able to receive a large enough lateral excitation from its neighborhood. In order to develop a high potential, it is not sufficient that a large number of neighbors of i are oscillatory. They must also have a certain degree of synchrony in their oscillations. In particular, they all must exceed the threshold θx at the same time in their oscillations. The purpose of introducing the lateral potential is that an oscillator with a high potential can lead the activation of an oscillator block corresponding to an object. Although a high-potential oscillator need not be stimulated, it must be stimulated in order to play the role of leading an oscillator block; otherwise, the oscillator will not oscillate at all. Thus, we require that a leader be always stimulated. More formally, an oscillator i is defined as a leader if pi ≥ θ and i is stimulated. The lateral potential of every oscillator is initialized to zero. The network we study for image segmentation is two-dimensional. Figure 3 shows the simplest case of permanent connectivity, where an oscillator is connected only with its four immediate neighbors except on the boundaries where no wraparound is used. Such connectivity forms a twodimensional grid. In general, however, N(i) should be larger, and the permanent connection weights should take on the form of a gaussian distribution with their distance. The coupling term Si in equation 3.1 is given by Si =

X k∈N(i)

Wik H(xk − θx ) − Wz H(z − θxz ),

(3.3)

814

DeLiang Wang and David Terman

where Wik is the dynamic connection weight from k to i. The neighborhood of the above summation is chosen to be the same as equation 3.2. In some situations, however, they should be chosen differently to achieve good results, and an alternative definition with two different neighborhoods is given elsewhere (Wang & Terman, 1996). Now let us explain permanent and dynamic connection weights. To facilitate synchrony and desynchrony, we assume that there are two kinds of synaptic weights (links) between two oscillators following von der Malsburg who argued for its neurobiological plausibility (von der Malsburg, 1981; von der Malsburg & Schneider, 1986; see also Crick, 1984). The permanent weight, or Tik , embodies the hardwired structure of a network. On the other hand, the dynamic weight, or Wik , rapidly changes. Wik is formed on the basis of Tik according to the mechanism of dynamic normalization (Wang, 1995). Dynamic normalization was previously defined as a two-step procedure: first update dynamic links and then normalization (Wang, 1995; Terman & Wang, 1995). There are different ways to realize such normalization. In the following, we give one way to implement dynamic normalization in differential equations, u0i = η(1 − ui )Ii − νui Wik0 = WT Tik ui uk − Wik

(3.4a) X

Tij ui uj .

(3.4b)

j∈N(i)

The function ui measures whether oscillator i is stimulated, and it is initialized to 0. The parameter η determines the rate of updating ui . When Ii > 0, ui → 1 quickly because we choose η À ν (see below); otherwise when Ii = 0, ui = 0. For this equation we assume Ii = 0 if oscillator i is unstimulated (otherwise it is easy to enforce this by applying a step function on Ii ). The parameter ν is chosen to be on the same order as ε, so that ui slowly relaxes back to 0 after the external stimulus is withdrawn. We assume that Wik are initialized to 0 for all i and k. It is easy to see that if oscillator i is unstimulated, Wik remains to be 0 for all k, and if oscillator k is unstimulated Wik = 0 for all i. Otherwise, if ui = 1 and uk = 1 for at least one k ∈ N(i), then at equilibrium, X WT Tik ui uk and Wik = WT . Wik = P j∈N(i) Tij ui uj k∈N(i) Thus the total dynamic weights converging to a single oscillator equals WT , which gives the desired normalization. Notice that dynamic weights, not permanent weights, participate in determining Si (see equation 3.3). Moreover, Wik can be properly set up in one step at the beginning based on external stimulation, which should be useful for engineering applications. Weight normalization is not a necessary condition for the selective gating mechanism to work. This conclusion has been established previously

Image Segmentation

815

(Terman & Wang, 1995). With normalized weights, however, the quality of synchronization within each oscillator block is better (Terman & Wang, 1995). In equation 3.3, Wz is the weight of inhibition from the global inhibitor z, whose activity, also denoted by z, is defined as z0 = φ(σ∞ − z),

(3.5)

where σ∞ = 1 if xi ≥ θzx for at least one oscillator i, and σ∞ = 0 otherwise. Hence θzx represents another threshold, and it is chosen so that only an oscillator jumping to the active phase can trigger the global inhibitor. If σ∞ equals 1, z → 1. The parameter φ represents the rate at which the inhibitor reacts to the stimulation from the oscillator network. The introduction of a lateral potential provides a solution to the problem of fragmentation. There is an initial period when the term exp(αt) exceeds the threshold θ . During this period, every stimulated oscillator is enabled. This allows the leaders to receive sufficient lateral excitation so that they can achieve a high potential. After this initial period, the only oscillators that can jump up without stimulation from other oscillators are the leaders. When a leader jumps up, it spreads its activity to other oscillators within its own block, so they also can jump up. These oscillators are referred to as followers. Oscillators not in this block are prevented from jumping up because of the global inhibitor. The oscillators that belong to the noisy fragments will not be able to jump up beyond the initial period because these oscillators will not be able to develop a sufficiently high potential by themselves and they cannot be recruited by leaders. These oscillators are referred to as loners. In order to be oscillatory beyond the initial time period, an oscillator must be either a leader or a follower. This indicates that the oscillator is not part of a noisy fragment, because noisy fragments in an image tend to be small and isolated (see Figure 1). The collection of all noisy regions whose corresponding oscillators are loners is called the background, which is not a uniform region and generally discontiguous. We have proven a number of rigorous results concerning the system equations 3.1 to 3.5. Our main result implies that the loners will no longer be able to oscillate after an initial time period. Moreover, the asymptotic behavior of a leader or a follower is precisely the same as the network obtained by simply removing all the loners. Together with the results in Terman and Wang (1995), this implies that after a number of oscillation cycles, a block of oscillators corresponding to a single major image region will oscillate in synchrony, while any two oscillator blocks corresponding to two major regions will desynchronize from each other. Also, the number of cycles required for full segmentation is no greater than the number of major regions plus one. The details of the analysis are given in Wang and Terman (1996).

816

DeLiang Wang and David Terman

The analysis in Wang and Terman (1996) is constructive in the sense that it leads to precise estimates that the parameters must satisfy. It shows that the results hold for a robust range of parameter values. Moreover, the analysis does not depend on the precise form of nonlinear functions in equation 3.1. The specific cubic and sigmoid functions (see Figure 2) are used because of their simplicity. In addition to the parameter description given earlier, we require that 0 < θ < 1, and α be chosen so that all the stimulated oscillators remain enabled for the first cycle, but only leaders remain enabled during the second cycle. In equation 3.2, we simply require that λ be on the same order of magnitude as 1, and 0 < θp < 1. The parameter η in equation 3.4 simply needs to be on the same order of 1. There are alternative ways of defining the model without affecting its essential dynamics (Wang & Terman, 1996). In particular, we have given a definition where dynamic normalization of connection strengths in equation 3.4 is not needed, but the quality of synchrony within each block and the flexibility for choosing parameters seem somewhat lessened. 4 Computer Simulation To illustrate how the LEGION network is used for image segmentation while eliminating fragmentation, we have simulated a 50 × 50 grid of oscillators with a global inhibitor as defined by equations 3.1 through 3.5. We map the three objects (designated as the sun, a tree, and a mountain) in Figure 1, and then add 20% noise so that each uncovered square has a 20% chance of being covered (stimulated). The resulting image is shown in Figure 4A. In the simulation, N(i) is the four nearest neighbors without boundary wraparound. For all the stimulated oscillators I = 0.2, while for the others I = 0. Notice that if oscillator i is unstimulated, Wik = Wki = 0 for all k, and Ii does not need to be negative to prevent i from oscillating. The amplitude ρ of the gaussian noise was set to 0.02. This represents a 10 percent noise level compared to the external stimulation. We observed during the simulations that noise facilitated the process of desynchronization. The differential equations equations (3.1–3.5) were solved using both a fourth-order Runge-Kutta method and the adaptive grid ordinary differential equations solver LSODE. Permanent connections between any two neighboring oscillators were set to 2.0, and for total dynamic connections (see equation 3.4b), WT = 8.0. Dynamic weights Wik were set up at the beginning according to equation 3.4. The following values for the other parameters in equations 3.1 through 3.5 were used: ε = 0.02, α = 0.005, β = 0.1, γ = 6.5, θ = 0.9, λ = 0.1, θx = −0.5, θp = 7.0, Wz = 1.5, η = 1.0, µ = ν = 0.01, φ = 3.0, and θzx = θxz = 0.1. The value of θp was chosen so that in order to achieve a high potential, an oscillator must have all four neighbors active. The simulation results were robust to considerable changes in the parameter values. Figures 4B–E shows the instantaneous activity (snapshot) of the network at various stages of dynamic evolution. The diameter

Image Segmentation

817

Figure 4: (A) An image composed of three patterns on a noisy background. The image is mapped to a 50 × 50 LEGION network. Each square corresponds to an oscillator. If a square is entirely covered, the corresponding oscillator receives external input; otherwise, the oscillator receives no external input. In the figure, B–E correspond to the case with the inclusion of the lateral potential, whereas F–J correspond to the case without the lateral potential. (B) A snapshot at the beginning of dynamic evolution. (C–E) Snapshots subsequently taken shortly after B. (F) A snapshot at the beginning of dynamic evolution for the case without the lateral potential. (G–J) Snapshots subsequently taken shortly after F.

818

DeLiang Wang and David Terman

of each black circle represents the x activity of the corresponding oscillator. Specifically, if the range of x values of all the oscillators is given by xmin and xmax , then the diameter of the black circle corresponding to one oscillator is set to be proportional to (x − xmin )/(xmax − xmin ). Figure 4B shows the snapshot at the beginning of the dynamic evolution. This is included to illustrate the random initial conditions. Figure 4C shows a snapshot shortly after Figure 4B. The effect of synchrony and desynchrony is clear: All the stimulated oscillators that belong to or are the neighbors of the sun are entrained and have large activities (in the active phase). At the same time, the oscillators stimulated by the rest of the image have very small activities (in the silent phase). Thus the noisy sun is segmented from the rest of the image. A short time later, as shown in Figure 4D, the oscillators in the group representing the noisy tree reach their active phase and are separated from the rest of the image. Figure 4E shows another snapshot, when the noisy mountain has its turn to be activated and separate from the rest of the input. This successive “pop-out” of the segments continues in a stable periodic fashion until the input image is withdrawn. To illustrate the entire segmentation process, Figure 5 shows the temporal evolution of every stimulated oscillator. The activities of the oscillators stimulated by each noisy object are combined together as one trace in the figure, and so are for the background. Since the oscillators receiving no external stimulation remain excitable and unable to oscillate throughout the simulation process, they are excluded from the display. The three upper traces represent the activities of the three oscillator blocks, and the fourth represents the background, consisting of all the scattered dots. Because of low potentials, these oscillators quickly become excitable even though they are enabled at the beginning. The bottom trace represents the activity of the global inhibitor. The synchrony within each block and desynchrony between different blocks are clearly shown after three cycles. To illustrate the role of the lateral potential, the same network with the same input (see Figure 4A) and the same initial condition has been simulated without the lateral potential. In this case, the Heaviside function in equation 3.1a is always 1 for every oscillator. With the random initial condition shown in Figure 4F, the network reaches a stable oscillatory behavior with four segments after fewer than three cycles. The four segments are shown in Figures 4G–J. Without the lateral potential, the network cannot distinguish major image regions from noisy fragments and separate major regions apart. With a fixed set of parameters, the dynamical system of LEGION can segment only a limited number of patterns. This number depends to a large extent on the ratio of the times that a single oscillator spends in the silent and active phases. Let us refer to this limit as the segmentation capacity of LEGION. In the above simulation, the number of the major blocks to be segmented is within the segmentation capacity. What happens if this number exceeds the segmentation capacity? From the analysis in Wang and Terman

Image Segmentation

819

Figure 5: Temporal evolution of every stimulated oscillator. The upper three traces show the combined x activities of the three oscillator blocks representing the three corresponding patterns indicated by their respective labels. The fourth trace shows the temporal activities of the loners, and the bottom trace shows the activity of the global inhibitor. The ordinates indicate the normalized activity of an oscillator or the inhibitor. The simulation took 9000 integration steps.

(1996), we know that the system will separate the entire image into as many segments as the capacity allows, where each segment may correspond to one major block (called a simple segment) or a number of major blocks (called a congregate segment). To illustrate this point, we show the following simulation, where we present to a 30 × 30 LEGION network with an arbitrary image containing nine binary patterns, which together form the phrase OHIO STATE as shown in Figure 6A. We then add 10 percent random noise to the input in a similar way as in Figure 4, resulting in Figure 6B. We use the same parameter values as in the simulations presented in Figure 4, except that γ = 8.0. For this set of parameters, our earlier experiments showed that the system’s segmentation capacity is less than 9. The simulation results are presented in Figures 6C–H. Shortly after the start of system evolution, the LEGION network segmented the input of Figure 6B into five segments, shown in Figures 6D–H, respectively. Among these five segments, three are simple segments (6D, 6E, and 6H) and two are congregate segments (6F and 6G). Besides Figure 6, many other simulations have been performed for the input of Figure 6B with different random initial conditions, and the results are comparable with Figure 6. There are different ways, however, that the

820

DeLiang Wang and David Terman

Figure 6: (A) An image composed of nine patterns mapped to a 30 × 30 LEGION network. See the legend of Figure 4 for explanations. (B) The image in A is corrupted by 10 percent noise. (C) A snapshot of network activity at the beginning of dynamic evolution. (D–H) Snapshots subsequently taken shortly after C.

system separates the nine noisy patterns into five segments. For this particular set of parameters, the segmentation capacity of the LEGION network is five. In fact, we have not seen a single simulation trial where more than five segments are produced. This important property of the system—it naturally exhibits a segmentation capacity—is in good accord with the well-known psychological principle that there are fundamental limits on the number of simultaneously perceived objects.

Image Segmentation

821

5 Real Images LEGION can segment gray-level images in a way similar to segmenting binary images. For a given image, a LEGION network of the same size as the image with a global inhibitor is used to perform segmentation. Each pixel of the image corresponds to an oscillator of the network, and we assume that every oscillator is stimulated when the image is applied to the network. The main difference between gray-level and binary images lies in how to set up connections. For gray-level images, the coupling strength between two neighboring oscillators is determined by the similarity of two corresponding pixels. This simple way of setting up the coupling strength addresses only the grouping principles of proximity, connectedness, and similarity (cf. section 1). 5.1 Algorithm. To segment real images with large numbers of pixels involves integrating a large number of the differential equations (equations 3.1–3.5). To reduce numerical computations on a serial computer, an algorithm is extracted from these equations. The algorithm follows major steps in the numerical simulation of the equations, and it exhibits the essential properties of relaxation oscillators, such as two time scales (fast and slow) and the properties of synchrony and desynchrony in a population of oscillators. Such extraction is quite straightforward because, in a relaxation oscillator network, much of the dynamics takes place when oscillators are jumping up or jumping down. Besides, the algorithm overcomes the segmentation capacity, which may be desired in some applications. More specifically, the following approximations have been made: 1. When no oscillator is in the active phase (see Figure 2), the leader closest to the jumping point (left knee) among all enabled oscillators is selected to jump up to the active phase. 2. An oscillator takes one time step to jump up to the active phase if the net input it receives from neighboring oscillators and the global inhibitor is positive. 3. The alternation between the active phase and the silent phase of a single oscillator takes one time step only. 4. All of the oscillators in the active phase jump down if no more oscillators can jump up. This situation occurs when the oscillators stimulated by the same pattern have all jumped up. LEGION algorithm. Only the x value of oscillator i, xi , is used in the algorithm. N(i) is assumed to be the eight nearest neighbors of i without wraparound. LKx , RKx , LCx represent the x values of three of four corner points of a typical limit cycle (see Figure 2A), where LC denotes the (upper) left corner of the limit cycle. By straightforward calculations, we obtain

822

DeLiang Wang and David Terman

LKx = −1, LCx = −2, RKx = 1. In the algorithm, Ii indicates the value of pixel i, and IM indicates the maximum possible pixel value. 1. Initialize 1.1 Set z(0) = 0; 1.2 Form effective connections

Wij = IM /(1 + |Ii − Ik |),

k ∈ N(i)

1.3 Find leaders " # X pi = H Wik − θp k∈N(i)

1.4 Place all the oscillators randomly on the left branch. Namely, xi (0) takes a random value between LCx and LKx . 2. Find one oscillator j so that (1) xj (t) ≥ xk (t), where k is currently on the left branch and Pk = 1; (2) pj = 1. Then:

xj (t + 1) = RKx ; z(t + 1) = 1

{jump up}

xk (t + 1) = xk (t) + (LKx − xj (t)),

for k 6= j and Pk = 1.

In this step, the leader on the left branch, which is closest to the left knee, is selected. This leader jumps up to the right branch, and all the other leaders move toward LK. 3. Iterate until stop If (xi (t) = RKx and z(t) > z(t − 1)) {stay on the right branch} xi (t + 1) = xi (t) else if (xi (t) = RKx and z(t) ≤ z(t − 1)) {jump down} xi (t) = LCx ; z(t + 1) = z(t) − 1 If (z(t + 1) = 0) go to step 2 else P Si (t + 1) = k∈N(i) Wik H(xk (t)) − Wz H(z(t) − 0.5) If (Si (t + 1) > 0) {jump up} xi (t + 1) = RKx ; z(t + 1) = z(t) + 1 else xi (t + 1) = xi (t) {stay on the left branch}

In a comparison of this algorithm with the dynamical system of equations 3.1–3.5, one can find the following simplifications. 1. The dynamic weight Wij is directly set to Wij = IM /(1 + |Ii − Ik |). The intuitive reason for this choice of weights is that the more pixel

Image Segmentation

823

i and pixel j are similar to each other, the stronger is the connection between the two corresponding oscillators. It is worth noting that the algorithm does not compute normalized weights. As mentioned in section 3, selective gating can still take place with this weight setting, even though the weights are not normalized. 2. The leaders are chosen during initialization. According to the dynamics described in section 3, lateral potentials, and thus leaders, are determined during a few initial cycles of oscillatory dynamics. Since every oscillator is stimulated and Wij is set at the beginning, it can be precisely predicted at the beginning which oscillators will become leaders. Thus, to save computational time, the leaders are determined in the initialization step. It should be clear that the number of leaders determined in this step does not correspond to the number of resulting segments; a major image region (segment) may generate many leaders. There are two critical parameters in the algorithm: Wz and θp , where the former is the strength of global inhibition and the latter is the threshold for forming high potentials (leaders). For Wz , higher values make the algorithm more difficult to group pixels into regions. Thus, in order for a region to be grouped together, the algorithm demands a higher degree of homogeneity within the region. Generally, given a gray-level image, higher Wz leads to more and smaller regions. For θp , higher values make the algorithm more difficult to develop leaders. Thus fewer leaders will be developed, and fewer regions result from the algorithm. On the other hand, regions produced with a higher θp tend to be more homogeneous. For image segmentation applications, it suffices to stop the algorithm when every leader has jumped up once. (See Wang & Terman, 1996, for some discussions on the algorithm.) 5.2 Segmenting Real Images. 5.2.1 Summation versus Maximation: An Aerial Image. The first image the algorithm is tested on is an aerial image, called Lake, which is shown in Figure 7A. As in the following images to be used, this is a typical gray-level image, where each pixel is an 8-bit number ranging from 0 to 255 (also called intensity), and pixels with higher values appear brighter. The image has 160 × 160 pixels and is presented to a LEGION network of 160 × 160 oscillators. For this simulation, Wz = 40 and θp = 1200. Quickly after the image is presented, the algorithm produces different segments at different time steps. Figures 7B–G display the first six segments that have been produced sequentially, where a black pixel corresponds to an oscillator in the active phase and a blank pixel corresponds to an oscillator in the silent phase. Each segment in the figure corresponds to a meaningful region in the original image: A segment is a lake, a field, or a parkway. The region in Figure 7B corresponds to a lake. The region in Figure 7C corresponds to the

824

DeLiang Wang and David Terman

Figure 7: (A) A gray-level image consisting of 160 × 160 pixels (courtesy of K. Boyer). (B–G) Segments popped out subsequently from the network shortly after the LEGION algorithm is executed.

main lake, except for the lower-left part and on the right side, where the lake region extends to nonlake parts. The parkway segment in Figure 7G picks up a partial parkway network in the original image. The other segments match well with the fields of the image. The entire image is separated into 16 regions and a background. To simplify the display, we put all the segments and the background together into one figure, using gray levels to indicate the phases of oscillator blocks. Such a display is called a gray map. The gray map of the results of this simulation is shown in Figure 8A, where the background is shown by the black scattered areas. Generally, the background corresponds to parts of the im-

Image Segmentation

825

Figure 8: (A) A gray map showing the result of segmenting Figure 7A. The algorithm produces 16 segments plus a background. (B) The result of another segmentation using maximization to compute Si . The algorithm produces 17 segments plus a background. (C) The result of another segmentation similar to B but with a different value for θp . The system produces 23 segments plus a background. The algorithm was run for 1000 steps for every case.

age that have high-intensity variations. Due to the mechanism of fragment removal, these noisy regions stay in the background, as opposed to many segments that would have been the result without fragment removal. In the above algorithm, an oscillator summates all the input from its neighborhood, and if the overall input is greater than the global inhibition,

826

DeLiang Wang and David Terman

the oscillator jumps to the active phase (see also equation 3.3). Another reasonable way of grouping is to replace summation by maximization when computing Si : Si = Maxk∈N(i) {Wik H(xk − θx )} − Wz H(z − θxz ).

(5.1)

Intuitively, the maximum operation concentrates on the relation of i with the oscillator in N(i) that has the strongest coupling with i but omits the relation between i and N(i) as a whole. Thus, grouping by maximization emphasizes pairwise pixel relations, whereas grouping by summation emphasizes pixel relations in a local field. By using equation 5.1 in the LEGION algorithm, the Lake image is segmented again. In this simulation, Wz = 20 and θp = 1200. Figure 8B shows the result of segmentation by a gray map. The entire image is segmented into 17 regions and a background, which is indicated by black areas. Each segment corresponds well with a relatively homogeneous region in the image. Interestingly, except for the parkway region in the lower part of the image, every region in Figure 8B has a corresponding one in Figure 8A. A comparison between the two figures reveals the difference between summation and maximization in segmentation. A closer comparison, however, indicates that the maximum scheme yields a little more faithful regions. On the other hand, regions in Figure 8A appear smoother and have fewer “black holes”—parts of the background. The smoothing effect of summation is generally positive, but it may lose important details. For example, the sizable hole inside the main lake region of Figure 8B corresponds to an island in the original image, which is neglected in the main lake region of Figure 8A. Another distinction is that grouping in the maximum scheme is symmetrical in the sense that if pixel a can recruit pixel b, then b can recruit a as well. This is because effective weights are symmetrical, namely, Wij = Wji (see the algorithm). Because the maximum scheme appears to produce better results, it will be used in all of the following simulations. To show the effects of parameters, we reduce the value of θp from 1200 in Figure 8B to 1000. As a result, more regions are segmented, as shown in Figure 8C, where 23 segments plus a background are produced by the algorithm. Compared with Figure 8B, the notable new segments include an open-theater-like region to the left of the main lake and its nearby field region. The Lake image has been used by Sarkar and Boyer (1993a), whose approach is edge based. Readers are encouraged to compare our results with theirs. 5.2.2 MRI Images. The next image to test our algorithm is an MRI (magnetic resonance imaging) image of a human head (see Figure 9A). MRI images constitute a large class of medical images, and their automatic processing is of great practical value. This particular image, which we denote as Brain-1, is a midsagittal section, consisting of 257×257 pixels. Salient regions

Image Segmentation

827

of this picture are the cerebral cortex, the cerebellum, the brainstem, the corpus callosum, the fornix (the bright stripe below the corpus callosum), the septum pellucidum (the region surrounded by the corpus callosum and the fornix), the extracranial soft tissue (the bright stripe on top of the head), the bone marrow (scattered stripes under the extracranial tissue), and several other structures (for the nomenclature see Kandel, Schwartz, & Jessell, 1991, p. 318). For this image, a LEGION network of 257 × 257 oscillators is used, and Wz = 25 and θp = 800. Figure 9B shows the result of one simulation by a gray map. The Brain-1 image is segmented into 21 regions plus a background, indicated by the black areas. Of particular interest are two parts of the brain: (1) the upper part and (2) the brainstem with part of the spinal cord (“brainstem” for short), parts of the extracranial tissue, and parts of the bone marrow. Other interesting segments are the neck part, the chin part, the nose part, and the vertebral segment. Although it is useful to treat the brain as a whole in some circumstances, it may also be desirable to segment the brain into more detailed structures. To achieve this in LEGION, one can increase Wz . But in order not to produce too many regions, θp should usually be increased as Wz increases. To show the combined effects, Figure 9C displays the result of another run with Wz = 40 and θp = 1000, where Brain-1 is segmented to 25 regions plus a background. Now the upper part of the brain is further segmented into the cerebral cortex, the cerebellum, the callosum-fornix region, and its surrounding septum. Because of the higher Wz , regions in Figure 9C tend to contain more background (compare, for example, the two brainstem regions). However, in the cortex segment in Figure 9C, the noisy stripes actually have physical meanings; they tend to match with various fissures on the cerebral cortex. With the higher θp , some segments in Figure 9B cannot generate any leaders and thus join the background. The final segmentation uses another MRI image of a human head, shown in Figure 10A. This image is denoted as Brain-2, consisting of 257×257 graylevel pixels. Brain-2 is a sagittal section through one eye. Salient regions of this picture are the cortex, the cerebellum, the lateral ventricle (the black hole within the cortex), the eye, the sinus (the black hole below the eye), the extracranial soft tissue, and the bone marrow. A LEGION network with 257 × 257 oscillators is used for the segmentation task. In the first simulation, Wz = 20 and θp = 800; Figure 10B shows the result by a gray map. Brain-2 is segmented into 17 regions plus a background, indicated by black scattered areas. One can see from the figure that the entire brain forms a single segment. Other significant segments are the eye, the sinus, parts of the bone marrow, and parts of the extracranial tissue. The lateral ventricle is put into the background. In order to generate finer structures, Wz is raised to 35 in the second simulation. As in Figure 9, θp is increased to 1000. Figure 10C shows the result of this simulation, where Brain-2 is segmented into 13 regions plus a background. As expected, the segments in Fig. 10B become further segmented

828

DeLiang Wang and David Terman

Figure 9: (A) A gray-level image consisting of 257 × 257 pixels (courtesy of N. Shareef). (B) A gray map showing the result of segmenting the image by a 257 × 257 LEGION network. The system produces 21 segments plus a background. (C) The result of another segmentation with different values of Wz and θp . The system produces 25 segments plus a background. The algorithm was run for 1200 steps in both B and C.

or shrunk, and the background becomes more extensive. Worth mentioning is that the brain segment in Figure 10B is segmented into three segments: one corresponding to the cortex and the other two corresponding to the cerebellum. Due to the increase of θp , the segments corresponding to the extracranial tissue and the marrow in Figure 10B disappear in Figure 10C.

Image Segmentation

829

Figure 10: (A) A gray-level image consisting of 257 × 257 pixels (courtesy of N. Shareef). (B) A gray map showing the result of segmenting the image by a 257 × 257 LEGION network. The system produces 17 segments plus a background. (C) The result of another segmentation with different values of Wz and θp . The system produces 13 segments plus a background. The algorithm was run for 1200 steps in both B and C.

In the segmentation experiments of this section, our goal was to illustrate the LEGION mechanism and the potential effectiveness of derived algorithms. We did not attempt to produce best possible results by fine tuning of parameters. One can easily tell this by the simple rule of setting Wij , the simple choice of N(i), and the values of Wz and θp that have been

830

DeLiang Wang and David Terman

used in the simulations. Therefore, better results can be expected by using more sophisticated schemes of choosing these parameters. Indeed, a more elaborate version of the LEGION algorithm has been applied to segment three-dimensional MRI and CT (computerized tomography) images, and good segmentation results have been obtained (Shareef et al., in preparation). 6 Discussion 6.1 Further Remarks on LEGION Computation. In the simulations of section 5, N(i) is set to the eight nearest neighbors of i. Larger N(i)’s entail more computations for determining leaders. But larger N(i)’s have more flexibility in specifying the conditions for creating leaders, which tends to produce better results. Also, the order of pop-out of different segments is currently random. In some situations, however, it might be useful to influence the order of pop-out by some criteria, such as the size of each region. LEGION may incorporate different criteria by ordering leaders accordingly, for example, by using the global inhibitor in different ways. The main difference between our approach to image segmentation and other segmentation algorithms reviewed in section 2 is that ours is neurocomputational, building on the strengths and the constraints of neural computation. Our approach relies on emergent behavior of LEGION to embody the computation involved in image segmentation. Methodologically, the computation is performed by a population of active agents—oscillators, in this case—that are driven by pixels, whereas in a typical segmentation algorithm, pixels are data to be processed by a central agent—the algorithm or the neural network trained as a pixel classifier. That LEGION is a massively parallel network of dynamical systems with mainly local coupling makes it particularly feasible for analog very large scale integration implementation, the success of which would be a major step toward real-time processing of scene segmentation. A thorny issue with scene segmentation is that often no unique answer exists. A house, for example, may be grouped into a single segment if viewed afar. The same house, if viewed nearby, may be broken into multiple segments, including a door, a roof, windows, and other elements. This situation demands a flexible treatment of scene segmentation; a system should be able to generate multiple sets of segmentation easily, each of which should be reasonable. In LEGION, this flexibility is reflected to a certain degree by the effects of the parameters of Wz and θp , as discussed in section 5. As noted there, both Figure 9B and Figure 9C offer arguably reasonable results. The LEGION network used so far has only one layer of oscillators. In section 4, we mentioned that the oscillatory dynamics of one-layer LEGION has a limited segmentation capacity (see Figure 6). It is interesting to note that the human perceptual system is also limited in simultaneously attending to the objects in a scene (Miller, 1956). We expect that the ability of LEGION

Image Segmentation

831

improves significantly when multiple layers are used in subsequent stages. This multistage processing provides a natural way out of this fundamental limitation. In multistage processing, each layer does not need to segregate more than several segments, and yet the system as a whole can segregate many more segments than the segmentation capacity, an idea reminiscent of chunking proposed by Miller (1956). When the number of segments in an image is greater than the segmentation capacity, one-layer LEGION will produce a number of segments (simple or congregate) up to the segmentation capacity (see Figure 6). Congregate segments, however, can be further segmented with another layer of LEGION, whereas simple segments will not segment further. With multistage processing, the hierarchical system can provide results of both coarse- and fine-grain segmentation. 6.2 Biological Relevance. The relaxation-type oscillator used in LEGION is dynamically very similar to numerous other oscillators used in modeling neuronal behavior. Examples include the FitzHugh-Nagumo equations (FitzHugh, 1961; Nagumo, Arimoto, & Yoshizawa, 1962) and the Morris-Lecar model (Morris & Lecar, 1981). These can all be viewed as simplifications of the Hodgkin-Huxley equations (Hodgkin & Huxley, 1952). In equation 3.1, the variable x corresponds to the membrane potential of the neuron, and y corresponds to the channel activation or inactivation state variable, which evolves on the slowest time scale. The reduction from a full Hodgkin-Huxley model to the two-variable model is achieved by assuming that the other, faster, channel state variables are instantaneous. The dynamics of the lateral potential as given in equation 3.2 has properties similar to those of certain membrane channels and excitatory chemical synapses. The N-methyl-D-aspartate channel, for example, turns off on a slow time scale (Traub & Miles, 1991). Moreover, with a sufficiently large input, a cell with these channels can be transformed from the excitatory to the oscillatory mode. We note that the lateral potential does not act as a temporal integrator of all the input converging on its corresponding oscillator, but utilizes sharp nonlinearity as embodied by the outer Heaviside function in equation 3.2. The theory of oscillatory correlation is consistent with the growing body of evidence that supports the existence of neural oscillations in the visual cortex and other brain regions. In the visual system, synchronous oscillations have been observed in cell recordings of the cat visual cortex (Eckhorn et al., 1988; Gray, Konig, ¨ Engel, & Singer, 1989). These neural oscillations are stimulus dependent and range from 30 to 70 Hz, often referred to as 40-Hz oscillations. Also, synchronous oscillations (locking with zero phase lag) occur across an extended brain region only if the stimulus constitutes a coherent object. These basic findings have been confirmed repeatedly in different brain regions and in different animal species (for reviews see Buzs´aki, Llin´as, Singer, Berthoz, & Christen, 1994; and Singer & Gray, 1995). The local excitatory connections assumed in LEGION conform with various lateral connections in the brain. Relating to the visual cortex, these ex-

832

DeLiang Wang and David Terman

citatory connections, which link the excitatory elements of oscillators, could be interpreted as the horizontal connections in the visual cortex (Gilbert & Wiesel, 1989; Gilbert, 1992). It is known that horizontal connections originate from pyramidal cells, which are of excitatory type, and pyramidal cells are also the principal target of the horizontal connections. Furthermore, at the functional level, physiological recordings from monkeys suggest that motion-based visual segmentation may be processed in the primary visual cortex (Stoner & Albright, 1992; Lamme, van Dijk, & Spekreijse, 1993). The global inhibitor (see Figure 3) receives input from the entire oscillator network and feeds back inhibition onto the network. It serves to segment multiple patterns simultaneously present in a visual scene, thus exerting a global coordination. Crick (1984) has suggested that part of the thalamus, the thalamic reticular complex in particular, may be involved in the global control of selective attention. The thalamus is uniquely located in the brain; it receives input from and sends projections to almost the entire cortex. This suggestion and key anatomical and physiological properties of the thalamus prompt us to speculate that the global inhibitor might correspond to a neuronal group in the thalamus. The activity of the global inhibitor should be interpreted as the collective behavior of the neuronal group. 6.3 Figure-Ground Segregation. The dynamics proposed in this article separates a scene into a number of major segments and a background, which corresponds to the rest of the scene. The major segments combine to form the foreground, whose corresponding oscillators are oscillatory until the input scene fades away. After a brief beginning period, the oscillators corresponding to the background become excitable and stop oscillating. This dynamics effectively gets rid of noisy fragments without either prior smoothing or postprocessing of removing small regions, the methods often used in segmentation algorithms. With this dynamics, typical figureground segregation can be characterized as a special case, where only one major segment is allowed to be separated from the scene. In this sense, we claim that our dynamics also provides a potential solution to the problem of figure-ground segregation. We allow a foreground to include multiple segments so that both scene segmentation and figure-ground segregation are incorporated in a unified framework. 6.4 Future Topics. In this study, we have not addressed the role of prior knowledge in image segmentation. For example, when people segment the images of Figures 9A and 10A, they inevitably use their knowledge of human anatomy, which describes among other things the relative size and position of major brain regions. A more complete system of image segmentation must address this issue. Besides prior knowledge, many grouping principles outlined in section 1 have not been incorporated into the system. One of the main future topics is to incorporate more grouping cues into the system. The global inhibitory mechanism will play a key role in overall

Image Segmentation

833

system coordination; it makes various factors compete with each other, and a final segment is formed because of strong binding within the segment. Our study in this article focuses exclusively on visual segmentation. It should be noted that neural oscillations occur in other modalities as well, including audition (Galambos, Makeig, & Talmachoff, 1981; Ribary et al., 1991) and olfaction (Freeman, 1978). Strikingly, these oscillations in different modalities show comparable frequencies. A recent study extended LEGION to deal with auditory scene segregation (Wang, 1996). With its computational properties and its biological relevance, the oscillatory correlation approach promises to provide a general neurocomputational theory for scene segmentation and perceptual organization. Acknowledgments We thank the two anonymous referees for their constructive comments. D.L.W. was supported in part by ONR grant N00014-93-1-0335, NSF grant IRI-9423312, and ONR YIP Award N00014-96-1-0676. D.T. was supported in part by NSF grant DMS-9423796. We thank S. Campbell for preparing Figure 2 and E. Cesmeli for his assistance in preparing Figure 7. References Abeles, M. (1982). Local cortical circuits. New York: Springer-Verlag. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. New York: Cambridge University Press. Adams, R., and Bischof, L. (1994). Seeded region growing. IEEE Trans. Pattern Anal. Machine Intell., 16, 641–647. Barlow, H. B. (1972). Single units and cognition: A neurone doctrine for perceptual psychology. Percept., 1, 371–394. Buzs´aki, G., Llin´as, R., Singer, W., Berthoz, A., & Christen, Y. (Eds.). (1994). Temporal coding in the brain. Berlin: Springer-Verlag. Crick, F. (1984). Function of the thalamic reticular complex: The searchlight hypothesis. Proc. Natl. Acad. Sci. USA, 81, 4586–4590. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex. Biol. Cybern., 60, 121–130. FitzHugh, R. (1961). Impulses and physiological states in models of nerve membrane. Biophys. J., 1, 445–466. Foresti, G., Murino, V., Regazzoni, C. S., & Vernazza, G. (1994). Grouping of rectilinear segments by the labeled Hough transform. CVGIP: Image Understanding, 58(3), 22-42. Freeman, W. J. (1978). Spatial properties of an EEG event in the olfactory bulb and cortex. Electroencephalogr. Clin. Neurophysiol., 44, 586–605. Galambos, R., Makeig, S., & Talmachoff, P. J. (1981). A 40-Hz auditory potential recorded from the human scalp. Proc. Natl. Acad. Sci. USA, 78, 2643–2647.

834

DeLiang Wang and David Terman

Geman, D., Geman, S., Graffigne, C., & Dong, P. (1990). Boundary detection by constrained optimization. IEEE Trans. Pattern Anal. Machine Intell., 12, 609– 628. Gilbert, C. D. (1992). Horizontal integration and cortical dynamics. Neuron, 9, 1–13. Gilbert, C. D., & Wiesel, T. N. (1989). Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9, 2432–2442. Gove, A., Grossberg, S., & Mingolla, E. (in press). Brightness perception, illusory contours, and corticogeniculate feedback. Visual Neurosci. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Grossberg, S., & Mingolla, E. (1985). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent computations. Percept. Psychophys., 38, 141–171. Grossberg, S., & Wyse, L. (1991). A neural network architecture for figure-ground separation of connected scenic figures. Neural Net., 4, 723–742. Gurari, E. M., & Wechsler, H. (1982). On the difficulties involved in the segmentation of pictures. IEEE Trans. Pattern Anal. Machine Intell., 4(3), 304–306. Haralick, R. M. (1979). Statistical and structural approaches to texture. Proc. IEEE, 67, 786–804. Haralick, R. M., and Shapiro, L. G. (1985). Image segmentation techniques. Comput. Graphics Image Process., 29, 100–132. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (Lond.), 117, 500–544. Horowitz, S. L., & Pavlidis, T. (1976). Picture segmentation by a tree traversal algorithm. J. ACM, 23, 368–388. Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (1991). Principles of neural science (3rd ed.). New York: Elsevier. Koffka, K. (1935). Principles of Gestalt psychology. New York: Harcourt. Koh, J., Suk, M., & Bhandarkar, S. M. (1995). A multilayer self-organizing feature map for range image segmentation. Neural Net., 8, 67–86. Kohler, R. (1981). A segmentation system based on thresholding. Comput. Graphics Image Process., 15, 319–338. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. Lamme, V. A. F., van Dijk, B. W., & Spekreijse, H. (1993). Coutour from motion processing occurs in primary visual cortex. Nature, 363, 541–543. Liou, S. P., Chiu, A. H., & Jain, R. C. (1991). A parallel technique for signal-level perceptual organization. IEEE Trans. Pattern Anal. Machine Intell., 13, 317–325. Manjunath, B. S., & Chellappa, R. (1993). A unified approach to boundary perception: Edges, textures, and illusory contours. IEEE Trans. Neural Net., 4, 96–108. Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol. Rev., 63, 81–97. Milner, P. M. (1974). A model for visual shape recognition. Psychol. Rev., 81(6), 521–535.

Image Segmentation

835

Mohan, R., & Nevatia, R. (1992). Perceptual organization for scene segmentation and description. IEEE Trans. Pattern Anal. Machine Intell., 14, 616–635. Morris, C., and Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle fiber. Biophys. J., 35, 193–213. Mozer, M. C., Zemel, R. S., Behrmann, M., & Williams, C. K. I. (1992). Learning to segment images using dynamic feature binding. Neural Comp., 4, 650–665. Murata, T., & Shimizu, H. (1993). Oscillatory binocular system and temporal segmentation of stereoscopic depth surfaces. Biol. Cybern., 68, 381–390. Nagumo, J., Arimoto, S., & Yoshizawa, S. (1962). An active pulse transmission line simulating nerve axon. Proc. IRE, 50, 2061–2070. Ribary, U., Ioannides, A. A., Singh, K. D., Hasson, R., Bolton, J. P. R., Lado, F., Mogilner, A., & Llinas, R. (1991). Magnetic field tomography of coherent thalamocortical 40-Hz oscillations in humans. Proc. Natl. Acad. Sci. USA, 88, 11037–11041. Rock, I., & Palmer, S. (1990). The legacy of Gestalt psychology. Sci. Am., 263, 84–90. Sarkar, S., & Boyer, K. L. (1993a). Integration, inference, and management of spatial information using Bayesian networks: Perceptual organization. IEEE Trans. Pattern Anal. Machine Intell., 15, 256–274. Sarkar, S., & Boyer, K. L. (1993b). Perceptual organization in computer vision: A review and a proposal for a classificatory structure. IEEE Trans. Syst. Man Cybern., 23, 382–399. Schalkoff, R. (1992). Pattern recognition: Statistical, structural and neural approaches. New York: Wiley. Schillen, T. B., & Konig, ¨ P. (1994). Binding by temporal structure in multiple feature domains of an oscillatory neuronal network. Biol. Cybern., 70, 397– 405. Sejnowski, T. J., & Hinton, G. E. (1987). Separating figure from ground with a Boltzmann machine. In M. A. Arbib & A. R. Hanson (Eds.), Vision, Brain, and Cooperative Computation (pp. 703–724). Cambridge, MA: MIT Press. Shareef, N., Wang, D. L., & Yagel, R. Segmentation of medical images using LEGION. In preparation. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Ann. Rev. Physiol., 55, 349–374. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neursci., 18, 555–586. Somers, D., & Kopell, N. (1995). Waves and synchrony in networks of oscillators of relaxation and non-relaxation type. Physica D, 89, 169–183. Sompolinsky, H., Golomb, D., & Kleinfeld, D. (1991). Cooperative dynamics in visual processing. Phys. Rev. A, 43, 6990–7011. Sporns, O., Tononi, G., & Edelman, G. M. (1991). Modeling perceptual grouping and figure-ground segregation by means of active re-entrant connections. Proc. Natl. Acad. Sci. USA, 88, 129–133. Stoner, G. R., & Albright, T. D. (1992). Neural correlates of perceptual motion coherence. Nature, 358, 412–414. Terman, D., & Wang, D. L. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176.

836

DeLiang Wang and David Terman

Traub, R. D., & Miles, R. (1991). Neuronal networks in the hippocampus. New York: Cambridge University Press. von der Malsburg, C. (1981). The correlation theory of brain function. (Internal Rep. 81-2). Tubingen: ¨ Max-Planck-Institute for Biophysical Chemistry. von der Malsburg, C., & Buhmann, J. (1992). Sensory segmentation with coupled neural oscillators. Biol. Cybern., 54, 29–40. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biol. Cybern., 54, 29–40. Wang, D. L. (1993a). Modeling global synchrony in the visual cortex by locally coupled neural oscillators. In Proc. of 15th Ann. Conf. Cognit. Sci. Soc. (pp. 1058– 1063). Boulder, CO. Wang, D. L. (1993b, August). Pattern recognition: Neural networks in perspective. IEEE Expert, 8, 52–60. Wang, D. L. (1995). Emergent synchrony in locally coupled neural oscillators. IEEE Trans. Neural Net., 6(4), 941–948. Wang, D. L. (1996). Primitive auditory segregation based on oscillatory correlation. Cognit. Sci., 20, 409–456. Wang, D. L., Buhmann, J., & Malsburg, C. v. d. (1990). Pattern segmentation in associative memory. Neural Comp., 2, 95–107. Wang, D. L., & Terman, D. (1995a). Locally excitatory globally inhibitory oscillator networks. IEEE Trans. Neural Net., 6(1), 283–286. Wang, D. L., & Terman, D. (1995b). Image segmentation based on oscillatory correlation. In Proc. of World Congress on Neural Net. (pp. II.521–525). Wang, D. L., & Terman, D. (1996). Image segmentation based on oscillatory correlation (Tech. Rep. No. 19). Columbus, OH: Ohio State University Center for Cognitive Science. Wertheimer, M. (1923). Untersuchungen zur Lehre von der Gestalt, II. Psychol. Forsch., 4, 301–350. Zucker, S. W. (1976). Region growing: Childhood and adolescence. Comput. Graphics Image Process., 5, 382–399.

Received April 1, 1996; accepted August 20, 1996.

Communicated by Steven Zucker

Stochastic Completion Fields: A Neural Model of Illusory Contour Shape and Salience Lance R. Williams David W. Jacobs NEC Research Institute, Princeton, NJ 08540, USA

We describe an algorithm- and representation-level theory of illusory contour shape and salience. Unlike previous theories, our model is derived from a single assumption: that the prior probability distribution of boundary completion shape can be modeled by a random walk in a lattice whose points are positions and orientations in the image plane (i.e., the space that one can reasonably assume is represented by neurons of the mammalian visual cortex). Our model does not employ numerical relaxation or other explicit minimization, but instead relies on the fact that the probability that a particle following a random walk will pass through a given position and orientation on a path joining two boundary fragments can be computed directly as the product of two vector-field convolutions. We show that for the random walk we define, the maximum likelihood paths are curves of least energy, that is, on average, random walks follow paths commonly assumed to model the shape of illusory contours. A computer model is demonstrated on numerous illusory contour stimuli from the literature. 1 Introduction The completion shape and salience problem is the problem of computing the shape and relative likelihood (as determined by the prior distribution) of the family of curves that potentially connect (or complete) a set of boundary fragments. This is a necessary intermediate step in the solution of the full figural completion problem, which has been previously characterized (Williams & Hanson, 1996) as the problem of building a Huffman-labeled figure representing the boundaries of hidden and visible surfaces (see Figure 1). The fundamental assumption underlying our work is that the paths followed by a particle undergoing a random walk in a lattice of discrete positions and orientations can be used to model the prior probability distribution of the shape of boundary completions (see also Mumford, 1994). We observe that the activity of a population of neurons with receptive fields tuned to discrete positions and orientations (like early areas of primate visual cortex) can be interpreted as a probability distribution describing a particle’s possible location (i.e., the current state of the Markov process defining the Neural Computation 9, 837–858 (1997)

c 1997 Massachusetts Institute of Technology °

838

Lance R. Williams and David W. Jacobs

Figure 1: Williams and Hanson (1996) have characterized figural completion as the problem of building a Huffman-labeled figure from a set of boundary fragments. A Huffman-labeled figure consists of a set of closed, oriented plane curves (representing surface boundary components) together with an explicit indication of relative depth at crossing points. The labeling scheme is based on Huffman’s concept of “X-ray pictures” of smooth objects (Huffman, 1971). (Left) Input stimulus. (Middle) Boundary fragments (thick) and potential boundary completions (thin) represented by cubic Bezier splines of least energy. The stochastic completion field (introduced in this article) is the corresponding parallel distributed representation. (Right) The human percept, represented as a Huffman-labeled figure. The problem of deriving the corresponding parallel distributed representation is a subject for future research.

random walk). Let there exist two subsets of states, P and Q, representing the beginning and ending points of a set of boundary fragments. We call P the set of sources and Q the set of sinks. Our goal is to compute the probability that a particle, initially in state (xp , yp , θp ), for some p ∈ P, will in the course of a random walk (representing a prior on completion shape) pass through state (u, v, φ), on its way to state (xq , yq , θq ), for some q ∈ Q, for all combinations of u, v, and φ. We call this parallel distributed representation of the family of curves that potentially connect pairs of boundary fragments a stochastic completion field. To appreciate better the scope of this article, it will be useful to identify those phenomena for which we believe the stochastic completion field model can and cannot account. Our primary claim is that the mode, magnitude, and variance of the stochastic completion field are related to the shape, salience, and sharpness of perceived illusory contours. However, the definition of stochastic completion field says nothing about the specific forms of brightness stimuli that can elicit illusory contours. In particular, we do not describe a comprehensive theory of how raw image brightnesses are transformed into sources and sinks for our diffusion process. Furthermore, the stochastic completion field says nothing about apparent brightness, which is distinct from salience. Finally, we note that whether a completion is ac-

Stochastic Completion Fields

839

tually perceived cannot be solely a function of local configurational factors, but also depends on the role the completion plays in the larger surface organization (Williams & Hanson, 1996). Specifically, it depends on (1) whether it can be incorporated in a consistent way into a Huffman-labeled figure and (2) whether it is nominally visible (modal) or occluded (amodal). Neither of these can be determined by a process that does not consider the topology of the scene. Consequently, the stochastic completion field can only be regarded as a precursor to a true surface representation. 2 Previous Work The novelty of our method lies in the definition of the stochastic completion field, which explicitly embodies the assumption that the prior distribution of completion shape can be modeled as a random walk. However, other researchers have advanced algorithm- and representation-level theories of figural completion based on orientation fields of various kinds. At the algorithm and representation level, all of these models are outwardly similar. This is because all derive from similar views of the orientation preference structure of the visual cortex, common assumptions about neural computation, and basic considerations (whether explicit or implicit) like translation and rotational invariance. The earliest theory of illusory contour shape is due to Ullman (1976). Ullman hypothesized that the curve used by the human visual system to join two contour fragments is constructed from two circular arcs. Each circular arc is tangent to its sponsoring contour at one end and to the other arc at the point of intersection. From the family of possible R curves of this form, the pair that minimizes total bending energy (E = κ(s)2 ds where κ(s) is curvature parameterized by arc length) models the shape of the illusory contour. Ullman suggested that illusory contours could be computed in parallel in a network by means of local operations, but he did not implement or test this idea. Grossberg and Mingolla (1985) describe a model of illusory contour formation that involves repeated convolution with a large-kernel filter. In outward form, this kernel resembles ours, but it does not represent the Green’s function of a stochastic process of the sort described by us or Mumford (1994). The outputs of oppositely oriented filters are combined through a nonlinear operation consisting of thresholding followed by a logical AND function. Grossberg and Mingolla’s network is complex, and its convergence properties are difficult to analyze. Clearly, part of this complexity stems from their desire to model, in a comprehensive way, the many different forms of stimulus that can elicit illusory contours (e.g., contrast edges of mixed sign, line endings). No other model (including ours) attempts to be this comprehensive. Parent and Zucker (1989) describe a network algorithm for computing a consistent discrete tangent and curvature field from noisy local tangent and

840

Lance R. Williams and David W. Jacobs

curvature measurements. Their network solves a well-defined relaxation labeling problem where the goal is to find the most probable assignment of labels subject to a compatibility relation. The labels are discrete tangent and curvature values at every image point, and the compatibility relation is based on co-circularity. Although they do not claim to model illusory contour shape, David and Zucker (1990) use active minimum-energy-seeking contours to locate the valleys (or, equivalently, the mode or ridge lines) of a potential function derived from the output of the Parent and Zucker network. The ridge lines of the potential function represent the integral curves of the tangent field. Like us, Heitger and von der Heydt (1993) describe a theory of figural completion based on nonlinear combination of the convolutions of “keypoint” images with a fixed set of oriented grouping filters. Significantly, they demonstrate their method on both illusory contour figures like the Kanizsa triangle and on more “realistic” images (e.g., of plants and rocks), with impressive results. Unfortunately, neither the equations defining the filters nor the proposed method of combination are described as a means of computing an underlying function. This makes analysis of their method difficult. However, because their grouping filters are scalar functions, they cannot (even implicitly) model a prior probability distribution of completion shapes in the manner we describe. Guy and Medioni (1996) have described a method for computing a vectorfield representing global image structure from local tangent measurements. Like the Hough transform, the key to their approach is the local summation of a set of global voting patterns. Unlike the Hough transform, the accumulator is spatially registered with the image, and the voting pattern is a vector-field, not a scalar-field. However, because the voting pattern represents orientations that are co-circular to the tangent measurements, the vector-field is nonstochastic (i.e., deterministic), and cannot model the prior distribution of completion shapes. Shashua and Ullman (1988) describe a network algorithm for computing “perceptual saliency,” although they do not portray it as a faithful model of human visual processing. Computing salience (as they define it) requires finding the energy of the minimum energy curve of length n beginning at every position and orientation. This energy function combines a term similar to squared curvature and a term measuring gap size. However, because the energy function also includes terms designed to guarantee convergence, the output of their system can be difficult to anticipate. Alter and Basri (1996) analyze the behavior of this network in detail and point out some of its counterintuitive behavior. 3 Fundamental Assumption Like many other problems in vision, figural completion is ill posed; the visual system cannot know the precise shape of an object’s boundary where

Stochastic Completion Fields

841

it is hidden from view by another object. The best that it can do in such a situation is to make an educated guess. It is conventionally assumed that this guess takes the form of a maximum likelihood estimate. That is, given a prior probability distribution of completion shapes, it is assumed that the visual system chooses the member of the distribution that is most probable. In this article, we propose a somewhat different approach. Instead of computing (just) the maximum likelihood completion shape, we propose that the visual system computes local image plane statistics of the distribution of possible completion shapes. There is more information contained in a distribution than just the location of the mode, and we believe that in the case of the distribution of boundary completion shapes, these other properties are psychophysically meaningful. In particular, we believe that the variance of the distribution about the mode is related to illusory contour sharpness. Like Mumford (1994), we characterize the distribution of completion shapes using the mathematical device of a particle undergoing a stochastic motion (a directional random walk). The random walk embodies the Gestalt principles of proximity and good continuation, which we believe originate in the statistics of the environment our visual systems evolved in. However, apart from shortness and smoothness (which have their origins in the world), we make one additional assumption, the origin of which (we believe) is in our heads. This is the assumption that the distribution of shapes that can extend a contour at any point is totally determined by the position and orientation of the contour at that point, that is, that the distribution of shapes can be modeled as a Markov process. An immediate consequence of this assumption is that for the random walk we define, it can be shown (see the appendix) that the maximum likelihood path followed by a particle joining two points at fixed orientations is a curve of least energy (where energy is a linear combination of the integral of curvature squared and length). This is the curve that, others have theorized (Horn, 1981), models the shape of illusory contours joining boundary fragments with orientation difference significantly less than π/2. However, the more important consequence is computational: The Markov assumption allows us to factor the stochastic completion field into source and sink fields, each of which can be computed by a linear transformation. To summarize, it is certainly possible to define prior distributions for boundary completion shape using devices other than a random walk (e.g., co-circularity). Unfortunately, the resulting priors will not be Markov. Consequently, the maximum likelihood completion shapes will not be curves of least energy, and it will not be possible to decompose the completion field into a product of more elementary fields. Unlike the familiar two-dimensional isotropic random walk, where a particle’s state is simply its position in the plane, the particles of the random walk described by us and Mumford (1994) possess both position and orientation. The random walk itself is defined by two elements: (1) the equations of motion and (2) a decay constant. The equations of motion describe

842

Lance R. Williams and David W. Jacobs

the change in a particle’s position and orientation as a function of time; the decay constant, τ , specifies a particle’s average lifetime. As part of a study of the reduction of search made possible by prior grouping in visual object recognition, Jacobs (1989) computed the probability density of the size of boundary gaps due to occlusion in random juxtapositions of a set of flat polygonal surfaces. Jacobs concluded that small gaps predominate and that incident frequency rapidly drops off with increasing size. By including a decay mechanism in the stochastic process, we are able to model the component of the completion shape probability distribution dependent on 1 length. Because a certain fraction of particles (1 − e− τ ) decay per unit time, longer paths are exponentially less likely. A particle’s position and orientation are updated according to the following stochastic nonlinear differential equation: x˙ = cos θ y˙ = sin θ θ˙ = N(0, σ 2 ), where x˙ and y˙ specify change in position, θ˙ is change in orientation (curvature), and N(0, σ 2 ) is a normally distributed random variable with zero mean and variance equal to σ 2 . There is a strong similiarity between the equations of motion defined here and the Kalman filter equations Cox, Rehg, and Hingorani (1993) employed in their work on edge tracking; a particle moves with constant speed in a direction that is continually changing by some random amount. The effect is that particles tend to travel in straight lines, but over time, they drift to the left or right by an amount proportional to σ 2 (for example random walks; see Figure 2). When σ 2 = 0, the motion is completely deterministic, and particles never deviate from straight paths. When σ 2 = ∞, the motion is completely random, and the stochastic process becomes a two-dimensional isotropic random walk. For this reason, we will sometimes refer to the stochastic process we define as a diffusion process and the parameter, σ 2 as the diffusivity. 4 Problem Formulation The receptive fields of neurons in area V1 of visual cortex are retinotopically organized and narrowly tuned to stimuli of specific position and orientation (Hubel and Wiesel, 1962). Indeed, it has been suggested (see Blasdel & Obermayer, 1994; or Grossberg & Olson, 1994) that the orientation preference structure that characterizes V1 is a consequence of mapping the space of positions and orientations in the plane (R2 ×S1 ) onto the two-dimensional surface of the cortex (R2 ) while maximizing locality. It follows that the activity of a population of neurons in V1 can be represented by a probability density function defined over R2 × S1 and that the network of intercon-

Stochastic Completion Fields

843

Figure 2: (Left) An example of a random walk. (Right) 1000 random walks.

nections can be represented by an order six tensor (R2 × S1 ) × (R2 × S1 ). Conservatively speaking, the function the network computes can be viewed as a tensor product, mapping an input probability density function to an output probability density function through a linear transformation. Let the order six tensor G represent the transition probabilities of a Markov process defined on the three-dimensional state space, R2 × S1 , and satisfying the equations of motion defined above (G is the Green’s function).1 It follows that G(u, v, φ, x, y, θ ; t) is the probability that a random walk of length t will end in state (u, v, φ) given that it began in state (x, y, θ) (see Figure 3). If p(x, y, θ ; 0) is the probability density function describing the position and orientation of a particle at the beginning of its random walk, then the probability that the particle will be at some position and orientation at time t is Z p(u, v, φ ; t) =

Z

∞

dx

−∞

Z

∞

dy

−∞

π

−t

dθ G(u, v, φ, x, y, θ ; t)p(x, y, θ ; 0) · e τ .

−π

Because the probability of a random walk of a given shape is independent of its initial position and orientation in the plane, the order six tensor, G, consists entirely of translated and rotated copies of an order three ten-

1 Carman (1990) has emphasized the usefulness of Green’s functions as descriptions of visual computations, and Carman and Welch (1992, 1993) have specifically proposed that they play a role in the computation of illusory surfaces.

844

Lance R. Williams and David W. Jacobs

Figure 3: (Upper left) P.d.f. representing source distribution, p(x, y, θ ; 0), consists of an impulse located at (0, 0, 0). Although sometimes truncated for clarity, in general, the length of the arrows is proportional to the logarithm of the probability that a particle is located at that position and orientation. (Upper right) Snapshot of diffusion process at t = 15, that is, p(x, y, θ ; 15). This is the result of convolving G(u0 , v0 , φ 0 ; 15) with the p.d.f. representing t = 0. (Lower left and right) Diffusion process at t = 31 and t = 47, respectively.

sor representing the transition probabilities for random walks beginning at (0, 0, 0), G(u, v, φ, x, y, θ ; t) = G(u0 , v0 , φ 0 , 0, 0, 0 ; t), where (u0 , v0 , φ 0 ) is (u, v, φ) in the coordinate system defined by (x, y, θ), so that u0 = (u − x) cos θ + (v − y) sin θ , v0 = −(u − x) sin θ + (v − y) cos θ

Stochastic Completion Fields

845

and φ 0 = φ − θ. Henceforward, we will write G(u0 , v0 , φ 0 ; t) instead of G(u0 , v0 , φ 0 , 0, 0, 0 ; t). The upshot is that by taking advantage of the translational and rotational symmetries of G, the tensor product can be computed by an operation similiar to convolution: Z p(u, v, φ ; t) =

Z

∞

dx

−∞

Z

∞

dy

−∞

π

−t

dθ G(u0 , v0 , φ 0 ; t)p(x, y, θ ; 0) · e τ .

−π

We will show that the stochastic completion field can be expressed as the product of a stochastic source field and a stochastic sink field. We define the stochastic source field, p0 (u, v, φ), to be the fraction of paths that begin in a source state and pass through (u, v, φ) before they decay. If we assume that paths do not self-intersect before they decay, then the fraction of paths that pass through a given position and orientation equals the integral of the position and orientation probability density function over time: p0 (u, v, φ) =

Z

∞

dt p(u, v, φ ; t).

0

By changing the order of integration and defining a new Green’s function, G0 , G0 (u, v, φ) =

Z

∞

dt G(u, v, φ ; t) · e− τ , t

0

the explicit time variable, t, can be suppressed. The result is that the stochastic source field, p0 (u, v, φ), can be computed by convolving the probability density function representing the source distribution with the new Green’s function (see Figure 4): p0 (u, v, φ) =

Z

Z

∞

dx

−∞

Z

∞

dy

−∞

π

dθ G0 (u0 , v0 , φ 0 ) p(x, y, θ ; 0).

−π

Observe that the probability2 that a particle will pass through state (u, v, φ) on its way from a source state, p ∈ P, to a sink state, q ∈ Q, before it decays, is proportional to the product of (1) the probability that a particle beginning in a source will reach (u, v, φ) before it decays (the source field) and (2) the probability that a particle beginning at (u, v, φ) will reach a 2

Strictly speaking, what we are able to compute is a relative likelihood, not a probability. This is because we do not compute the absolute probability of a particle diffusing from a source to a sink by any path at all, which would represent the normalizing constant. So although it is possible to convert these relative likelihoods to probabilities, in practice this is not necessary, because relative likelihoods suffice for the purpose of comparing competing paths.

846

Lance R. Williams and David W. Jacobs

Figure 4: Stochastic source fields, p0 (u, v, φ), representing the probability that a particle leaving (0, 0, 0) will reach (u, v, φ) before it decays. Basically, these are plots of the Green’s function G0 (the time integral of G), for a range of diffusivities and decay constants. (Upper left) (σ 2 = 0.05, τ = 100). Upper right: (σ 2 = 0.05, τ = 20). (Lower left) (σ 2 = 0.01, τ = 100). Lower right: (σ 2 = 0.01, τ = 20).

sink before it decays (the sink field). This is an immediate consequence of the Markov property of the stochastic process. We have already described how the source field is computed. We now show that the sink field can be computed in an analogous manner. The magnitude of the sink field at (u, v, φ) is the product of the probability that a particle beginning at (u, v, φ) will reach (x, y, θ) before it decays (i.e., G0 (x, y, θ, u, v, φ)) and the probability that a sink exists at (x, y, θ) (i.e.,

Stochastic Completion Fields

847

q(x, y, θ ; 0)), integrated over all (x, y, θ): q0 (u, v, φ) =

Z

Z

∞

dx

Z

∞

dy

−∞

−∞

π

dθ G0 (x, y, θ, u, v, φ)q(x, y, θ ; 0).

−π

Like G, the order six tensor G0 consists entirely of translated and rotated copies of an order three tensor representing the transition probabilities for random walks beginning at (0, 0, 0), G0 (x, y, θ, u, v, φ) = G(x0 , y0 , θ 0 , 0, 0, 0), where (x0 , y0 , θ 0 ) is (x, y, θ ) in the coordinate system defined by (u, v, φ), so that x0 = (x − u) cos φ + (y − v) sin φ, y0 = −(x − u) sin φ + (y − v) cos φ and θ 0 = θ − φ. Consequently (like the source field), the sink field can be computed by a kind of convolution: q0 (u, v, φ) =

Z

Z

∞

dx

−∞

Z

∞

dy

−∞

π

dθ G0 (x0 , y0 , θ 0 ) q(x, y, θ ; 0).

−π

Finally, the stochastic completion field, C(u, v, φ), which represents the relative likelihood that a particle leaving a source state will pass through (u, v, φ) and enter a sink state before it decays, equals the product of the source and sink fields (see Figure 5): C(u, v, φ) = p0 (u, v, φ) · q0 (u, v, φ). At this point, it is natural to ask how the computation we have just outlined might be implemented. In particular, we wish to identify the neural loci of the representations that form the basis of our model (the source, sink, and completion fields). To begin, we note that it would be somewhat premature for us to identify the source and sink fields with a specific population in V1 because many neural populations in V1 satisfy the basic requirements of having retinotopically organized receptive fields tuned to a specific position and orientation. However, with regard to the completion field, we can be somewhat more definite and identify its locus as the neurons in V2 described by von der Heydt, Peterhans, and Baumgartner (1984). von der Heydt et al. report that the firing rate of these neurons increases when their “receptive fields” are crossed by illusory contours (of specific orientations) induced by pairs of flanking elements. Significantly, these neurons do not respond to these same elements when presented in isolation; they respond only to pairs. It was this discovery that inspired the simple feedforward neural network model proposed by Peterhans, von der Heydt, and Baumgartner (1986). At the implementation level, our model is very similar. However, while Peterhans et al. do not describe the pattern of interconnectivity between the first and second layers of their network in any detail, this pattern is defined

848

Lance R. Williams and David W. Jacobs

Figure 5: (Top left) Source and sink distribution p(x, y, θ; 0) = q(x, y, θ; 0). The p.d.f. consists of four impulses equally spaced around the circumference of a circle. (Top right) The stochastic source field, p0 (u, v, φ), represents the probability that a particle leaving a source will reach (u, v, φ) before it decays. (Bottom left) The stochastic sink field, q0 (u, v, φ), represents the probability that a particle will leave state (u, v, φ) and enter a sink state before it decays. (Bottom right) The stochastic completion field, C(u, v, φ), is the product of source and sink fields.

in our model by the Green’s function, G0 (see Figure 6). Consequently, our neural network computes a stochastic completion field. Although it is possible to solve the stochastic nonlinear differential equation directly and, in doing so, find an analytic equation for the Green’s function (see Thornber & Williams, 1996), we initially solved for G0 by a Monte Carlo method. All of the experimental results in this article are based

Stochastic Completion Fields

849

p( x, y, ; 0)

q( x, y, ; 0)

G’

G’

p’( u, v, )

q’( u, v, )

Source Field

Sink Field

C( u, v, )

Completion Field

Figure 6: The stochastic completion field can be computed by a two-layer feedforward neural network. The source field is computed by the left half of the network and the sink field by the right half. The Green’s function, G0 , defines the network of interconnections between the first and second layers. The third layer computes the product of the source and sink fields and can be tentatively identified with the neurons in V2 described by von der Heydt, Peterhans, and Baumgartner (1984).

on an approximation of G0 computed by direct simulation of the random walk for 1.0 × 106 trials on a 256 × 256 grid with 36 fixed orientations.3 The particle trajectories are modeled as chains of line segments with real valued end points and with interior angles drawn from a continuous gaussian distribution. The probability that a particle beginning at (0, 0, 0) will reach (u, v, φ) before it decays (i.e., G0 (u, v, θ)) is approximated by the fraction of simulated trajectories beginning at (0, 0, 0) that intersect the region (u ± 1.0, v ± 1.0, φ ± π/72). Because the end points of these segments are not confined to the integer lattice, the stochastic completion fields we compute 3

cost.

In a computer vision application, this would be a compile-time cost, not a run-time

850

Lance R. Williams and David W. Jacobs

Figure 7: Example completion fields. (Left) p1 = (−40, 0, 30◦ ) and q1 = (40, 0, −30◦ ). (Right) p2 = (−40, 0, 30◦ ) and q2 = (40, 0, −30◦ ). The left source and sink pair is scaled by 1.0 × 107 and the right by 1.0 × 108 .

are discrete subsamplings of continuous fields, not discrete fields computed by an approximation to the continuous process. 5 Experimental Results As a first demonstration, let us consider two source-sink pairs. The source and sink distributions were represented by arrays of size 128 × 128 × 36 and consisted of a single oriented impulse (a single nonzero value). In the first case, source and sink are positioned on a horizontal axis and are oriented symmetrically about this axis; in the second, they possess the same orientation, so that the curves of least energy joining them will contain an inflection point. The diffusivity, σ 2 , equaled 0.005 and the decay constant, τ , equaled 20. Figure 7 depicts the stochastic completion field for these source and sink pairs as brightness images where brightness encodes the sum over all 36 orientations. Because, the displayed brightnesses for each pair are scaled to take maximum advantage of the limited number of gray levels (i.e., 255), it is not obvious that there is an order of magnitude difference in saliency between the first and second pair. The first source and sink is scaled by 1.0 × 107 , the second by 1.0 × 108 . It is sometimes assumed that illusory contours cannot contain inflection points. However, we do not believe that there is an all-or-nothing rule that prohibits inflection points (Kellman & Shipley, 1991). Rather, we believe that saliency is a continuum, and illusory contours containing inflection points might simply be an order of magnitude weaker on average. In a second demonstration, we consider a source distribution consist-

Stochastic Completion Fields

851

Figure 8: (Left) Ehrenstein figure. (Right) Stochastic completion field for Ehrenstein figure.

ing of four oriented impulses equally spaced around the circumference of a circle (see Figure 5, top left). This distribution is meant to represent an Ehrenstein figure (see Figure 8, left). The four impulses are located at end points of the four line segments comprising the figure and possess orientation normal to the segments. Figure 8 (right) shows the stochastic completion field, where brightness encodes the sum over all orientations. The majority of the brightness is confined to a narrow region surrounding an approximately circular ridge line. This is consistent with the shape most often reported by human observers. Interestingly, some observers report seeing a diamond shape, not a circle. We note that a diamond-shaped completion field would be produced if the diffusivity (σ 2 ) was very large and the particle half-life (τ ) was very small. We have also demonstrated our implementation on several well-known illusory contour figures from the visual psychology literature. Before doing this, we needed to find a principled way of translating an image into a set of sources and sinks for the diffusion process. In general, this requires a sophisticated analysis of the local image brightness structure to identify Ljunctions, T-junctions, Y-junctions, and X-junctions formed by both contrast and outline edges. Classifying and measuring the multiple orientations at so-called key points (Heitger & von der Heydt, 1993) is a difficult research problem in its own right and the subject of much current research (Simoncelli & Farid, 1996; Freeman & Adelson, 1991; Michaelis & Sommer, 1994; Perona, 1992). For the moment, we use a steerable one-sided filter scheme similar to that of Simoncelli and Farid (1996) to identify orientation discontinuities (i.e., corners) formed by contrast edges and to measure the two

852

Lance R. Williams and David W. Jacobs

Figure 9: (Top left) A corner of a “pacman.” (Top right) Sources (thin) and sinks (thick) computed by steerable one-sided filter (multiple orientations are due to anti-aliasing). (Bottom left) Kanizsa triangle. (Bottom right) Stochastic completion field summed over all orientations and superimposed on brightness gradient magnitude image. Both the illusory triangle and the three discs are completed.

orientations with precision sufficient for our purposes. The maximum and minimum of the continuous response of the steerable one-sided filter is first measured to approximately 3 degrees of accuracy. As an anti-aliasing measure, a unit mass is distributed proportionally among the two discrete orientations (10 degrees apart) straddling the nominal maximum and minimum. The maximum is interpreted as a source and the minimum as a sink (see Figure 9, top left and right). Generalizing this scheme to classify and measure the wide range of events corresponding to likely points of boundary

Stochastic Completion Fields

853

occlusion is the only obstacle to demonstrating our work on more realistic images. Figure 9 (bottom left) shows the Kanizsa triangle stimulus. Key points are located at a positive maximum of curvature. Figure 9 (bottom right) shows the stochastic completion field summed over all orientations and superimposed on the brightness gradient magnitude image. Both the illusory triangle and the three discs are completed. In contrast with the results of Heitger and von der Heydt (1993), no nonmaximum suppression needs to be employed. Figure 10 (top) shows the Kanizsa “paisley” stimulus. Key points are located at negative minima of curvature. Figure 10 (bottom) shows the stochastic completion field summed over all orientations and superimposed on the brightness gradient magnitude image. Figure 11 (top) shows a complex illusory contour figure designed by Kanizsa (1979). Key points are located at positive maxima of curvature. It is useful to compare the stochastic completion field computed for this display to the middle panel of Figure 1. Because our diffusion process has no knowledge of the topology of surfaces, the stochastic completion field, shown in Figure 11 (bottom left), contains potential completions that are not perceived by human subjects. Potential completions required to complete the four rectangles are among the most salient, however. Figure 11 (bottom right) shows the logarithm of the stochastic completion field summed over all orientations. In the logarithm image, many additional completions of significantly lower average likelihood become visible. Included among these are those required to complete the four black discs and eight black squares perceived by human subjects.

6 Conclusion It is widely acknowledged that perceptual organization (segmentation, grouping) is among the most difficult problems facing researchers in computer vision today. Current structure from motion algorithms produces estimates of depth only for isolated points, yet true surface representations are required to compute a stable grasp or to plan an unobstructed path through free space. Current object recognition systems do not work on large model bases, nor do they function robustly in the presence of occlusion. Our work advances the state of the art in perceptual organization by providing a plausible model of illusory contour formation based on a diffusion process. We have shown how curves of least energy interpolating a set of edge fragments can be computed using biologically plausible algorithms and representations. Significantly, this is accomplished without using numerical relaxation or other explicit minimization but instead relies on the fact that the probability that a particle following a random walk will pass through a particular position and orientation on a path joining two image

854

Lance R. Williams and David W. Jacobs

Figure 10: (Top) Kanizsa’s “paisleys.” (Bottom) Stochastic completion field integrated over all orientations and superimposed on brightness gradient magnitude image. It is interesting to note that the average likelihoods of the shorter completions (perceived modally) are several orders of magnitude greater than the average likelihoods of the longer completions (perceived amodally).

Stochastic Completion Fields

855

Figure 11: (Top) A complex figure (Kanizsa, 1979). (Bottom) Stochastic completion field integrated over all orientations and superimposed on brightness gradient magnitude image. Logarithm of stochastic completion field integrated over all orientations.

measurements can be computed directly as the product of two vector-field convolutions. The most related work is by Heitger and von der Heydt (1993), Guy and Medioni (1993) and Shashua and Ullman (1988). Although we owe an intellectual debt to these three articles, none proceeds cleanly from a definition of what they want to compute (a function) to a method of computing it (an algorithm). Our main contribution is to show how outwardly similiar algorithms and representations follow immediately from a clearly stated assumption: that the prior probability distribution of boundary completion shapes can be modeled by a random walk.

856

Lance R. Williams and David W. Jacobs

Appendix In this appendix, we demonstrate the relationship between the diffusion process we have described and so-called curves of least energy. To accomplish this, we employ a discrete-time approximation to the continuous equations of motion. For the discrete-time random walk, we show that the most likely path between a source and a sink is the curve of least energy connecting them. This tells us that, in the limit, as the time-step size becomes small, the most likely path between a source and a sink converges to the continuous curve of least energy. From this, we conclude that our model of diffusion concisely encodes the prior assumption that the most likely shape of an occluded section of an object’s boundary is the curve of least energy. Suppose that a particle begins its random walk at a source p and follows a trajectory 0, which consists of n unit length steps, along with n changes in angle denoted by κ1 , . . . , κn , ending at sink q. That is, its trajectory is an n-sided polygonal arc, comprising unit length segments and with exterior angles denoted by κi . From the definition of the random walk, we know that the density function on the set of paths that such a particle may take is given by g(0pq ) = R 0q

n k2 Y 1 1 1 − i e− τ √ e 2σ 2 , f (0p )d0p i=1 σ 2π

R

where 0q f (0p )d0p indicates integration of the density function on the set of paths beginning in source p (i.e., f (0p )) over all paths ending in sink q (see Ash, 1972, for a discussion of conditional probabilities on R continuous functions). Taking the logarithm of both sides (and because 0q f (0p )d0p is constant for given p and q), µ ¶ X n √ 1 ki2 log(g(0pq )) + C = n − − log(σ 2π ) − . τ 2σ 2 i=1

(A.1)

We now relate this expression to the energy of a continuous curve. This energy is defined as Z

Z α

0

κ(t)2 dt + β

0

dt,

where κ(t) is the curvature at 0(t) and α and β are constants that weight the cost for the length relative to the cost for the total squared curvature. Although the energy of any curve with a curvature discontinuity is infinite, it is natural to consider an analogous quantity for polygonal arcsP based on number of segments (n); and sum of the squared exterior angles ( ni=1 ki2 ). This is precisely the negative of the right side of equation (A.1), with α = 1/2σ 2 and

Stochastic Completion Fields

857

√ β = 1/τ +log(σ 2π). This shows that the log likelihood that a random walk will follow any given polygonal path is linearly related to the energy of that path for a natural discrete formulation of energy. Therefore, the maximum likelihood polygonal path between two points is the discrete curve of least energy. Because the discrete-time random walk converges to a continuoustime random walk as the time-step size, velocity, and diffusivity approach zero, it follows that the continuous equations of motion concisely encode our prior assumption concerning the probability distribution of completion shape. Acknowledgments We gratefully acknowledge many helpful conversations with Karvel Thornber. Thanks also to Ingemar Cox and Bill Bialek. References Alter, T., & Basri, R. (1996). Extracting salient contours from images: An analysis of the saliency network, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, San Francisco, pp. 13–20. Ash, R. (1972). Real analysis and probability. San Diego: Academic Press. Blasdel, G., & Obermeyer, K. (1994). Putative strategies of scene segmentation in monkey visual cortex. Neural Networks, 7(6–7), 865–881. Carman, G. J. (1990). Mappings of the cerebral cortex. Unpublished doctoral dissertation, California Institute of Technology, Pasadena, CA. Carman, G. J., & Welch, L. (1992). Three-dimensional illusory contours and surfaces. Nature, 360, 585–587. Carman, G. J., & Welch, L. (1993). Position, orientation and shape of real and illusory three-dimensional surfaces. Investigative Opthalmology and Visual Science (ARVO), 34(4), 1131. Cox, I. J., Rehg, J. M., & Hingorani, S. (1993). A Bayesian multiple hypothesis approach to edge grouping and contour segmentation. International Journal of Computer Vision, 11, 5–24. David, C., & Zucker, S. W. (1990). Potentials, valleys and dynamic global coverings. Inernational Journal of Computer Vision, 5(3), 219–238. Freeman, W. T., & Adelson, E. H. (1991). The design and use of steerable filters for image analysis, enhancement and wavelet representation. IEEE Pattern Analysis and Machine Intelligence, 13, 891–906. Grossberg, S., & Mingolla, E. (1985). Neural dynamics of form perception: Boundary completion, and illusory figures, and neon color spreading. Psychological Review, 92, 173–211. Grossberg, S., & Olson, S. J. (1994). Rules for the cortical map of ocular dominance and orientation columns. Neural Networks, 7(6–7), 883–894. Guy, G., & Medioni, G. (1993). Inferring global perceptual contours from local features. Inernational Journal of Computer Vision, 20, 113–133.

858

Lance R. Williams and David W. Jacobs

Heitger, R., & von der Heydt, R. (1993). A computational model of neural contour processing, figure-ground and illusory contours. Proc. of 4th Intl. Conf. on Computer Vision, Berlin, Germany. Horn, B. K. P. (1981). The curve of least energy (MIT AI Lab Memo No. 612). Cambridge, MA: MIT. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Huffman, D. A. (1971). Impossible objects as nonsense sentences. Machine Intelligence, 6, 295–323. Jacobs, D. (1989). Grouping for recognition (MIT AI Lab Memo No. 1177). Cambridge, MA: MIT. Kanizsa, G. (1979). Organization in vision. New York: Praeger. Kellman, P. J., & Shipley, T. F. (1991). A theory of visual interpolation in object perception. Cognitive Psychology, 23, 141–221. Michaelis, M., & Sommer, G. (1994). Junction classification by multiple orientation detection. Proc. of European Conf. on Computer Vision, pp. 101–108. Mumford, D. (1994). Elastica and computer vision. Algebraic Geometry and Its Applications. Chandrajit Bajaj (Ed.). New York: Springer-Verlag. Parent, P., & Zucker, S. W. (1989). Trace inference, curvature consistency and curve detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11, pp. 823–889. Perona, P. (1992). Steerable, scalable kernels for edge detection and junction analysis. Proc. of European Conf. on Computer Vision, pp. 3–18. Peterhans, E., von der Heydt, R., & Baumgartner, G. (1986). Neuronal responses to illusory contour stimuli reveal stages of visual cortical processing. In J. Pettigrew, K. Sanderson, & W. Levick, (Eds.), Visual Neuroscience. Cambridge: Cambridge University Press. Shashua, A., & Ullman, S. (1988). Structural saliency: The detection of globally salient structures using a locally connected network. 2nd Intl. Conf. on Computer Vision, Clearwater, FL, pp. 321–327. Simoncelli, E. P., & Farid, H. (1996). Steerable wedge filters for local orientation analysis. IEEE Transactions on Image Processing, 5(9), 1377–1382. Thornber, K. K., & Williams, L. R. (1996). Analytic solution of stochastic completion fields. Biological Cybernetics, 75, 141–151. Ullman, S. (1976). Filling in the gaps: The shape of subjective contours and a model for their generation. Biological Cybernetics, 21, 1–6. von der Heydt, R., Peterhans, E., & Baumgartner, G. (1984). Illusory contours and cortical neuron responses. Science, 224, 1260–1262. Williams, L. R., & Hanson, A. R. (1996). Perceptual completion of occluded surfaces. Computer Vision and Image Understanding, 64(1), 1–20.

Received May 5, 1995; accepted July 12, 1996.

Communicated by Steven Zucker

Local Parallel Computation of Stochastic Completion Fields Lance R. Williams David W. Jacobs NEC Research Institute, Princeton, NJ 08540, USA

We describe a local parallel method for computing the stochastic completion field introduced in the previous article (Williams and Jacobs, 1997). The stochastic completion field represents the likelihood that a completion joining two contour fragments passes through any given position and orientation in the image plane. It is based on the assumption that the prior probability distribution of completion shape can be modeled as a random walk in a lattice of discrete positions and orientations. The local parallel method can be interpreted as a stable finite difference scheme for solving the underlying Fokker-Planck equation identified by Mumford (1994). The resulting algorithm is significantly faster than the previously employed method, which relied on convolution with large-kernel filters computed by Monte Carlo simulation. The complexity of the new method is O(n3 m), while that of the previous algorithm was O(n4 m2 ) (for an n × n image with m discrete orientations). Perhaps most significant, the use of a local method allows us to model the probability distribution of completion shape using stochastic processes that are neither homogeneous nor isotropic. For example, it is possible to modulate particle decay rate by a directional function of local image brightnesses (i.e., anisotropic decay). The effect is that illusory contours can be made to respect the local image brightness structure. Finally, we note that the new method is more plausible as a neural model since (1) unlike the previous method, it can be computed in a sparse, locally connected network, and (2) the network dynamics are consistent with psychophysical measurements of the time course of illusory contour formation. 1 Introduction In recent years, several different researchers (Grossberg & Mingolla, 1985; Guy & Medioni, 1996; Heitger & von der Heydt, 1993; Williams & Jacobs, 1997) have formulated computational models of perceptual completion and illusory contours based on large-kernel convolution.1 Although the function computed in each case is different, each method requires computing 1 See “Stochastic Completion Fields” elsewhere in this journal for a detailed literature review.

Neural Computation 9, 859–881 (1997)

c 1997 Massachusetts Institute of Technology °

860

Lance R. Williams and David W. Jacobs

one or more convolutions with filters represented by kernels of size equal to the largest gaps to be bridged (usually taken to be the size of the image). From the standpoint of computer vision, it should be clear that computing large numbers of convolutions with kernels of size equal to the image size is prohibitively expensive. These methods also have deficiencies as models of human visual processing, since implementation in vivo would require pairs of neurons separated by large amounts of visual arc to be densely interconnected. Although this possibility cannot be ruled out,2 a model based on local connections clearly makes weaker demands on neuroanatomy (see Figure 1). Data from human psychophysics are also inconsistent with methods requiring large-kernel convolution. First, experiments by Rock and Anson (1979) (and by Ramachandran, Ruskin, Cobb, Rogers-Ramachandran, & Tyler, 1994) demonstrate (see Figure 2) that illusory contours can be suppressed by the presence of texture or other figural elements along the path the completion would follow if these elements were absent. That is, the shape, salience, and sharpness of the completion are not solely a function of characteristics of the pair of elements it bridges but are also a function of the intervening pattern of image brightnesses. Second, recent studies of the time course of illusory contour formation (Rubin, Shapley, & Nakayama, 1995) show that the time required for an illusory contour to form is at least partly a function of the size of the gap to be bridged.3 Again, this is consistent with a local-iterative process since a global process would require an amount of time independent of the size of the gap to be bridged. In the previous article (Williams and Jacobs, 1997), we presented a theory of the shape, salience, and sharpness of illusory contours and other perceptual completions. This theory was based on the assumption that the visual system computes the probability that an object boundary (possibly occluded) exists at each position and orientation in the visual field. This representation was termed a stochastic completion field. The inputs to this computation are measurements derived from the image about locations where completions might begin and end and a model for the probability distribution of completion shapes. Like Mumford (1994), we proposed that the prior probability distribution of completion shapes can be modeled by

2

Absence of evidence is not evidence of absence. This result is due to Rubin, Shapley, and Nakayama (1995), who used a forced-choice discrimination task together with masking stimulus to estimate the time required for illusory contour formation. They first showed that the time required for an illusory contour to form increases in direct proportion to distance when the “pacmen” in a Kanizsa display are “pulled apart.” Somewhat surprisingly, they then showed that the time required for an illusory contour to form remains approximately constant when the entire figure is uniformly scaled. This somewhat paradoxical result is consistent with a stochastic completion field “pyramid” computed by repeated small-kernel convolution. 3

Computation of Stochastic Completion Fields

(a)

p( x, y, ; 0)

861

q( x, y, ; 0)

G’

G’

p’( u, v, )

q’( u, v, )

Source Field

Sink Field

C( u, v, )

Completion Field

(b)

p( x, y, ; 0)

q( x, y, ; 0)

p( u, v, ; t)

q( u, v, ; t)

p’( u, v, )

q’( u, v, )

Source Field

Sink Field

C( u, v, )

Completion Field

Figure 1: Two possible neural networks for computing stochastic completion fields. (a) Convolution with large-kernel filters requires that neurons separated by large amounts of visual arc be densely interconnected. (b) Repeated smallkernel convolution makes weaker demands on neuroanatomy since it can be implemented in a network with only local connections.

862

(a)

Lance R. Williams and David W. Jacobs

(b)

Figure 2: Rock and Anson (1979) show that illusory contours in the Kanizsa triangle can be suppressed by the presence of a background texture. (a) Kanizsa triangle. (b) Kanizsa triangle with background texture.

a particle undergoing a stochastic motion (i.e., a directional random walk).4 Unlike the familiar two-dimensional isotropic random walk, where a particle’s state is simply its position in the plane, the particles of the directional random walk possess both position and orientation. The particle moves in the plane with constant speed in the direction specified by its orientation. ˙ is a normally distributed random variable with Change in orientation (θ) zero mean and variance equal to σ 2 so that the particle’s orientation (θ ) is a Brownian motion. The result is that particles tend to travel in straight lines but deviate from straightness by an amount that depends on σ 2 . This reflects a prior expectation of smoothness. In addition, a constant fraction of 1 particles decay per unit time (1 − e− τ , where τ is the half-life of the particle), and this reflects a prior expectation of shortness. If there exist two subsets of measurements, P and Q (representing the beginning and ending points of a set of boundary fragments), then the stochastic completion field, C(u, v, φ), is defined as the probability that a particle, initially at (xp , yp , θp ), for some p ∈ P, will, in the course of a di-

4

The important thing is not the particles themselves but the distribution of occluded shapes that the device of stochastic particle motion is intended to characterize. A particle follows a trajectory that represents one of the many possible shapes that an occluded object’s boundary might have in the region where it is not directly visible. The probability that a particle follows a path of a particular shape is presumed to equal the frequency with which that shape occurs in the world.

Computation of Stochastic Completion Fields

863

rectional random walk, pass through (u, v, φ), on its way to (xq , yq , θq ), for some q ∈ Q. This definition emphasizes the input and output of the computation that the visual system is performing (the function). It describes what information from the image and generic knowledge of the world should be combined to estimate the shape of occluded object boundaries. Using Marr’s terminology (Marr, 1982), this is a computational theory—level model. We feel that our work demonstrates the value of Marr’s approach for understanding human vision. This approach distinguishes between theories at three levels: (1) computational theories, (2) algorithm- and representation-level theories, and (3) implementation-level theories. Identical functions might be computable by different algorithms, and identical algorithms can be implemented (in the brain) in different ways. Many previous theories of figural completion have been specified only at the level of algorithm and representation. These models use an abstraction of biological neural networks as a kind of programming language. Unfortunately, it is difficult to understand exactly what problem these programs solve and what assumptions they make. Consequently the activity of units in these models and the computations they perform are often impossible to interpret. In contrast, our computational theory has allowed us to formulate algorithms based on operations that are meaningful for distributions of contours.5 In this article, we show how the stochastic completion field can be computed efficiently in a local parallel network. The network computation is based on a finite difference scheme for solving the Fokker-Planck equation identified by Mumford (1994). This algorithm avoids computationally prohibitive large-kernel convolutions and is more consistent with known psychophysics and neuroanatomy. 2 Review of Previous Method Because the random walk is defined by a Markov process, C(u, v, φ) is proportional to the product of the probability that a particle beginning in a source will reach (u, v, φ) before it decays (the source field) and the probability that a particle beginning at (u, v, φ) will reach a sink before it decays (the sink field). Furthermore, because the Markov process is translation and rotation invariant, the probability that a particle will reach (u, v, φ) can be computed by “convolving” the source distribution, p(x, y, θ ; 0), with the Green’s function representing the probability that a particle at (0, 0, 0) at time zero will reach (u0 , v0 , φ 0 ), p0 (u, v, φ) =

Z

Z

∞

dx

−∞

Z

∞

dy

−∞

π

dθ G0 (u0 , v0 , φ 0 ) p(x, y, θ ; 0),

−π

5 It has also made obvious the relationship between the output of our algorithms and curves of least energy (see Mumford, 1994; “Stochastic Completion Fields,” elsewhere in this journal). This relationship is totally opaque in previous models.

864

Lance R. Williams and David W. Jacobs

where (u0 , v0 , φ 0 ) is (u, v, φ) rotated by −θ about (x, y), so that u0 = (u − x) cos θ + (v − y) sin θ, v0 = −(u − x) sin θ + (v − y) cos θ and φ 0 = φ − θ. The Green’s function, G0 , was originally computed by a Monte Carlo method that involved simulating the random walks of 1.0 × 106 particles on a 256 × 256 grid with 36 fixed orientations. The probability that a particle beginning at (0, 0, 0) reaches (x, y, θ ) before it decays was approximated by the fraction of simulated trajectories beginning at (0, 0, 0) that intersect the region (x ± 1.0, y ± 1.0, θ ± π/72). The sink field, q0 (u, v, φ), which represents the probability that a particle leaving (u, v, φ) will reach a sink state before it decays, can be computed in a similar fashion, q0 (u, v, φ) =

Z

Z

∞

dx

−∞

Z

∞

dy

−∞

π

dθ G0 (x0 , y0 , θ 0 ) q(x, y, θ ; 0),

−π

where (x0 , y0 , θ 0 ) is (x, y, θ ) rotated by −φ about (u, v), so that x0 = (x − u) cos φ + (y − v) sin φ, y0 = −(x − u) sin φ + (y − v) cos φ, and θ 0 = θ − φ. Finally, the stochastic completion field, C(u, v, φ), which represents the relative likelihood that a particle leaving a source state will pass through (u, v, φ) and enter a sink state before it decays, equals the product of the source and sink fields: C(u, v, φ) = p0 (u, v, φ) · q0 (u, v, φ). Clearly, the run-time complexity (neglecting the time required to compute G) of the above process is dominated by the convolution with the Green’s function G0 . The overall run-time compexity is therefore O(n4 m2 ) for an n×n image with m discrete orientations. 3 Local Parallel Computation In this section we show how to compute the completion field in a local parallel network. Fundamentally, the computation is the same as before: 1. Compute the source field. 2. Compute the sink field. 3. Compute their product. The difference is in the way the source and sink fields are computed. In the new algorithm, the source field is computed by integrating the FokkerPlanck equation for the stochastic process identified by Mumford (1994). The probability density function representing a particle’s position at time t0 is given by p(x, y, θ ; t0 ) = p(x, y, θ ; 0) +

Z 0

t0

∂p(x, y, θ ; t) ∂t

Computation of Stochastic Completion Fields

865

and ∂P ∂P ∂ 2P ∂P = − cos θ − sin θ + σ 2 /2 − 1/τ P, ∂t ∂x ∂y ∂θ 2 where P is p(x, y, θ ; t). The Fokker-Planck equation for the stochastic process is best understood by considering each of the four terms separately. The first two terms taken together are the so-called advection equation: ∂P ∂P ∂P = − cos θ − sin θ . ∂t ∂x ∂y Solutions of this equation have the form p(x, y, θ ; t) = p(x + cos θ 1t, y + sin θ 1t, θ ; t + 1t) for small 1t. That is, the probability density function for constant θ at times t and t + 1t is related through a translation by the vector [cos θ 1t, sin θ 1t]. Intuitively, for constant θ, probability density is being transported in the θ direction with unit speed. The third term is the classical diffusion equation: ∂ 2P ∂P = σ 2 /2 . ∂t ∂θ 2 Solutions of the diffusion equation are given by gaussian convolutions. Although, strictly speaking, the gaussian is not defined for angular quantities, for small 1t and variance σ 2 the following will approximately hold: p(x, y, θ ; t + 1t) =

Z π 1 dφ p(x, y, θ − φ ; t) √ σ 2π 1t −π · exp(−φ 2 /2σ 2 1t).

The last term implements the exponential decay: ∂P = −1/τ P. ∂t It is possible to interpret the Fokker-Planck equation in this case as describing a set of independent advection equations (one for each possible value of θ) coupled in the θ dimension by the diffusion equation. Taken together, the terms comprising the Fokker-Planck equation faithfully model the evolution in time of the probability density function representing a particle’s position and orientation. The value of the source field will also depend on the initial conditions (the probability density function [p.d.f.] describing the particle’s position at time

866

Lance R. Williams and David W. Jacobs

zero). In our implementation, the p.d.f. defining the initial conditions consist of oriented impulses located at corners detected in the intensity image. These impulses represent the starting (or ending) points of possible completions. Strictly speaking, the Fokker-Planck equation can model our diffusion process only if we assume smooth initial conditions. Unfortunately, because the p.d.f.s defining the source and sink distributions are discontinuous, their partial derivatives are not defined. Moreover, the discrete method that we describe for solving the Fokker-Planck equation will be accurate only when a first-order approximation holds well over the length of scales that we discretely represent. For now, we will assume that the initial conditions are smooth and are well approximated by first-order expressions. We then show how to solve this partial differential equation iteratively. In the next section, we consider the accuracy of this method and describe a stochastic interpretation of the local method that does not require continuous initial conditions. A standard technique for solving partial differential equations on a grid is to use the Taylor series expansion to write the partial derivatives at grid points as functions of the values of nearby grid points and of higher-order terms. Then by neglecting higher-order terms, we obtain an equation that expresses the value of a grid point at time t + 1 as a function of nearby grid points at time t, and which is accurate to first order (Ames, 1992; Ghez, 1988). This technique leads to the following iterative method for solving the Fokker-Planck equation: ( Step 1:

t+1/4 px,y,θ

− cos θ ·

ptx,y,θ − ptx−1,y,θ

if cos θ > 0

ptx+1,y,θ − ptx,y,θ

if cos θ < 0

=

ptx,y,θ

Step 2:

t+1/2 px,y,θ

=

t+1/4 px,y,θ

Step 3:

px,y,θ

t+3/4

=

λ px,y,θ −1θ + (1 − 2λ) px,y,θ + λ px,y,θ+1θ

Step 4:

pt+1 x,y,θ

=

e− τ · px,y,θ

− sin θ ·

t+1/2

1

 t+1/4  pt+1/4 x,y,θ − px,y−1,θ

if sin θ > 0

 pt+1/4 − pt+1/4 x,y,θ x,y+1,θ

if sin θ < 0

t+1/2

t+1/2

t+3/4

where λ = σ 2 /2(1θ)2 . These equations require a few comments. First we have split the computation into four separate steps. This standard fractional method allows us to compute p(x, y, θ ; t + 1) with a series of separate convolutions in each of the three spatial directions. Second, notice that the first two steps of the computation depend on the sign of cos θ and sin θ. To understand why, consider a particle traveling in the positive x direction. In this case, we wish to ensure that our approximation of ∂p(x, y, θ ; t)/∂x depends on p(x − 1, y, θ ; t) but not on p(x + 1, y, θ ; t) because p(x, y, θ ; t + 1) depends on the first value but not the second. This method of ensuring that the

Computation of Stochastic Completion Fields

867

local difference used to approximate derivatives depends on the direction relevant to the underlying physical process is called upwind differencing and is necessary for stability. Finally, we can see that the recurrence relationship given by these equations can be computed through small-kernel convolution. For example, we can compute the first recurrence using a 1 × 3 kernel centered on the pixel being convolved. This kernel is [cos θ, 1 − cos θ, 0] for cos θ > 0 and [0, 1 + cos θ, − cos θ ] for cos θ < 0. The source field, p0 (x, y, θ ), which represents the probability that a particle beginning at a source will reach (x, y, θ) before it decays, is computed by integrating the probability density function representing the particle’s position over time: Z ∞ dt p(x, y, θ ; t). p0 (x, y, θ ) = 0

We can approximate the integral of p(x, y, θ ; t) over all times less than some fixed time t0 as follows, 0

0

p (x, y, θ ; t ) ≈

t0 X

p(x, y, θ ; t),

t=0

which can be computed by the following recurrence: p0 (x, y, θ ; t + 1) = p0 (x, y, θ ; t) + p(x, y, θ ; t + 1). Since O(n) iterations are required to bridge the largest gaps that might be found in an image of size n × n and each iteration requires O(n2 m) time, the run-time complexity of the small-kernel method is O(n3 m). In our experiments, the image size is typically 256 × 256 with 36 discrete orientations. This represents an effective speedup on the order of one thousand times when compared against the previous algorithm. We note that since the new algorithm is local and parallel, the communication overhead on a SIMD computer would be negligible, so that the parallel time complexity would be O(n) on a machine with O(n2 m) processors. This is consistent with the estimated time required for illusory contours to form in human vision (Rubin et al., 1995). 4 Stability and Accuracy Considerations We have presented a method of computing the stochastic completion field using repeated small-kernel convolutions. We now consider how faithfully this new method models the underlying stochastic process. First, we identify the conditions under which our finite-difference scheme converges to the underlying partial differential equation that it models. We then present a stochastic interpretation of the small-kernel method. This interpretation embodies our original (i.e., physical) intuition and applies even in the case

868

Lance R. Williams and David W. Jacobs

of discontinuous initial conditions. Finally, we compare the accuracy of our new method to our previous approach, which used a single large convolution mask. We can use standard techniques to determine that our method will be stable, provided that the following three conditions are met: 1. λ = (σ 2 1t)/2(1θ)2 ≤ 12 . 1t ≤ 1. 2. cos θ 1x 1t ≤ 1. 3. sin θ 1y

Informally, stability is necessary to ensure that the solution of the partial differential equation at each time step incorporates all the possibly relevant information from the initial conditions and to ensure convergence to the correct solution as space and time are discretized more finely. In the experiments that we report in this article, we assume that 1x = 1y = 1t = 1. Therefore, the last two conditions will always be met. However, should we choose to discretize space more finely (to achieve greater accuracy), then we must also discretize time proportionately. Regarding the first condition, in all of our experiments we set 1θ = π/36. With this level of discretization, our method will be stable only when σ ≤ π/36. All of our experiments use values of σ much less than this limit. We now present a stochastic interpretation of repeated small-kernel convolution. This discussion serves two purposes. First, the stochastic interpretation more directly embodies our original (physical) intuition that the prior probability distribution of completion shape can be modeled as a random walk in a lattice of discrete positions and orientations. Second, our stochastic interpretation applies even in the case of discontinuous initial conditions, where interpreting the small-kernel method as a first-order approximation to a partial differential equation breaks down. The small-kernel method can be viewed as computing the probability distribution of the position of a particle undergoing a random walk in a lattice in R2 × S1 . Suppose there is a particle in this lattice, with position (x, y, θ ), which at the first time step: • Moves to the neighboring site in the x direction with probability max(0, cos θ). • Moves to the neighboring site in the −x direction with probability max(0, − cos θ). • Remains at the current site with probability (1 − | cos θ |). See Figure 3a. Clearly the convolution defined by the Step 1 equation updates the probability distribution on the lattice consistent with this motion. In a similar fashion, the change in the probability distribution of the particle’s y-position is updated in the second time step. This update is described

Computation of Stochastic Completion Fields

S T E P

cos

1−cos

1

S T E P

869

sin 1−sin

2

Y

(a)

X

S T E P

(b)

S T E P

1−2

3

e

1

4

(c)

(d)

Figure 3: The small-kernel method can be interpreted as a random walk in a lattice of discrete positions and orientations. (a) Step 1: Advection in x direction (for cos θ ≥ 0). (b) Step 2: Advection in y direction (for sin θ ≥ 0). (c) Step 3: Diffusion in θ . (d) Step 4: Exponential decay.

by the Step 2 equation (see fig. 3b). In the third time step, the particle: • Moves to the neighboring site in the θ direction with probability λ. • Moves to the neighboring site in the −θ direction with probability λ. • Remains at the current site with probability (1 − 2λ). See Figure 3c. The Step 3 equation updates the probability distribution on the lattice consistent with this motion. Finally, the Step 4 equation implements the exponential decay of the particle (see Figure 3d). After t iterations of these four steps, we have computed the distribution of the position of a set of particles at time t, given initial conditions that specify a distribution of their positions at time zero. This stochastic process is an exact model of

870

Lance R. Williams and David W. Jacobs

what our small-kernel method computes and is valid even when the initial conditions are discontinuous. We now estimate the difference in magnitude between the outputs of the small- and large-kernel methods. Previously we constructed a large-kernel filter by simulating the paths of a large number of particles. Each particle starts its stochastic motion at the origin, with a direction of θ = 0. Time was divided into discrete units, and at each time step, 1θ was drawn from a continuous, gaussian distribution. The x and y positions of the particle at each time step were represented as real numbers. The magnitude of the large-kernel filter at any given point equaled the fraction of trajectories intersecting a small volume of x − y − θ space. It is straightforward to show P that the distribution of a particle’s x position after t time steps is ti=1 ci (where ci = cos θi and si = sin θi ). We now contrast this with the small-kernel method. If we let xi be a random variable that is 1 with probability ci and 0 with probability 1 − ci , then repeated small-kernel convolution produces values in the x direction P that are a probability distribution for the random variable ti=1 xi . Since the Pt expected value of xi is ci , it follows that the distribution of i=1 xi has mean Pt Pt Pt 2 2 i=1 ci and variance i=1 (xi − ci ) = i=1 (ci − ci ). This variance can range from 0 (when all ci = 0 or all ci = 1) to as much as t/4 when all ci equal 1/2 (when θ = π/3). Moreover, it follows from Lindeburg’s theorem that if this distribution is normalized to be of mean 0 and unit variance, it will converge to a normal distribution as t → ∞. This depends on the variables satisfying the Lindeburg condition, which states roughly that each term in the sum becomes, in the limit, a vanishingly small component of the overall sum (see Feller, 1971). Similar reasoning tells us that the distribution of a particle’s position P P in the y direction will have mean ti=1 si and variance ti=1 (yi − si )2 = Pt 2 i=1 (si − si ), which can also range from 0 to t/4. So, for example, with t = 100 the standard deviation of this “spread” in the position of a particle is bounded by five pixels. This effect is not isotropic. The above expressions show that particles traveling in directions parallel to the x and y axis will have zero variance, while those traveling in the π/4 direction will have maximum variance.6 Also, the source field can be computed with greater accuracy (if needed) at the cost of additional computation. Suppose that we divide each unit in space, angle, and time into d subunits and perform the same computations on these subunits. Similar calculations show that the variance in each direction will be bounded by t/4d pixels (i.e., the standard deviation drops with 6 The lack of any detectable global phase coherence in the orientation preference structure of visual cortex (Niebur & Worg ¨ otter, ¨ 1994) suggests that (unlike the obvious computer implementation on a rectangular grid) systematic long-range errors due to local anisotropy will be minimal in vivo.

Computation of Stochastic Completion Fields

871

the square root of the number of discrete units used, as might be expected). We must, however, perform td convolutions on a field of size nd × nd × m, rather than t convolutions on a field of size n × n × m, increasing the run time by a factor of d3 . 5 Anisotropic Decay The fact that previous computation took the form of a convolution with a large kernel was a simple consequence of the translational and rotational invariance of the random process. That is, the likelihood of a particle’s taking a random walk of a given shape is assumed to be independent of the random walk’s starting point and orientation. Because the new method is based on repeated small-kernel convolution, we are free to consider stochastic processes that are not the same everywhere, that is, processes that are neither homogeneous nor isotropic. There is some precedent for this in computer vision, where different methods lumped together under the term “anisotropic diffusion” are becoming increasingly popular for image enhancement (Perona & Malik, 1990; Nitzberg & Shiota, 1992; Weickert, 1995). The analogy breaks down in the sense that the quantity being “diffused” in the one case is image brightness, and in our case, the probability density of a particle’s position and orientation. Conceivably each of the two parameters defining the stochastic process (σ 2 and τ ) could be modulated by local directional functions of the image brightness. Interpreted within a Bayesian framework, this could be viewed as augmenting the prior model with the additional information provided by the local image brightnesses to estimate better the parameters of the stochastic process modeling completion shape within the local neighborhood. In this article, we take only a first step in this direction by modulating the half-life of the particle by a directional function of the local image brightness. Using this anisotropic decay mechanism, we can ensure that illusory contours do not cross large brightness discontinuities. The first requirement is that in regions where brightness is changing slowly, a particle’s half-life should remain constant. This will allow illusory contours to form in these regions. The second requirement is that along edges, the half-life should be very short in directions leading away from the edge but very long in directions leading toward the edge. This will have the effect of channeling particles toward the edges. Finally, we require that the half-life should be very long in the direction tangent to the edge. This will encourage particles to continue traveling along the edge. All of these requirements can be satisfied by a function based on the directional second derivative,  D2 I  if D2θ +π/2 I 6= 0 and D1θ I > 0  τ ·e θ 2 −D I τ (θ) = τ ·e θ if D2θ +π/2 I 6= 0 and D1θ I < 0   τ + |∇I| otherwise,

872

Lance R. Williams and David W. Jacobs

where D1θ I and D2θ I are the first and second directional derivatives of image brightness in the direction of the particle’s motion and D2θ +π/2 I is the second directional derivative in the direction normal to the particle’s motion. The condition D2θ +π/2 I 6= 0 is simply used to decide whether the particle is traveling along an edge (a zero crossing of the second directional derivative in the direction normal to the particle’s motion). If so, the half-life is taken to be τ + |∇I|, which becomes τ in homogeneous areas as required, irrespective of whether there are zero crossings. If the particle is not traveling in the direction of an edge, then the half-life is τ · exp(D2θ I) or τ · exp(−D2θ I), depending on the sign of D1θ I. This expression is large or small depending on whether the particle is moving toward or away from an edge, and it evaluates to τ in homogenous areas. 6 Experimental Results Through a sequence of pictures shown in Figure 4, we first demonstrate the propagation of two probability density “waves” emitted from a pair of sources located on a line parallel to the x-axis and with orientations equal to π/6 and 5π/6. The pictures show the probability density (p(x, y, θ ; t)) summed over all orientations (the marginal distribution in θ ). The time interval between successive pictures in the sequence, 1t, equals 5, the variance, σ 2 , equals 0.005, and the half-life, τ , equals 100. Error due to the inherent anisotropy of the finite-difference scheme is especially pronounced in the first five frames (frames 1–5) where it manifests itself as a flattening of the face of the wave along the vertical direction. This effect diminishes as the size of the wave increases, the result of greater effective resolution. In frames 11–20 the waves from the different sources can be seen to pass through each other. Because they are traveling in different “orientation planes,” there is no “collision.” Finally, we observe that there is a visible decrease in brightness between the first and last frame of the sequence. Although some of this decrease in brightness can be attributed to the diffusion in space, the larger part can be attributed to the exponential decay of the particles. A second sequence of images, shown in Figure 5, depicts the formation of the completion field as a function of time. To simplify this demonstration, p.d.f.s representing the source and sink distributions at time zero were assumed to be equal. Frames 1–10 are totally dark, since insufficient time has elapsed for particles traveling from the two sources to reach each other. The completion field first appears in frame 11, at the point midway between both sources. In frames 12–20 the completion field rapidly spreads outward in both directions away from the midpoint and back toward the two sources.7 7 Assuming that the neural locus of the stochastic completion field is the population of neurons in V2 identified by von der Heydt, Peterhans, and Baumgartner (1984) (see “Stochastic Completion Fields” elsewhere in this journal), then this prediction could be verified by measuring the latency between stimulus presentation and the onset of the

Computation of Stochastic Completion Fields

873

Figure 4: Sequence of images depicting probability density “waves” emitted by a pair of sources located on a line parallel to the x-axis and with orientations equal to π/6 and 5π/6. The pictures show the probability density (p(x, y, θ ; t)) summed over all orientations (the marginal distribution in θ). The time interval between successive pictures in the sequence, 1t, equals 5; the variance, σ 2 , equals 0.005; and the half-life, τ , equals 100.

874

Lance R. Williams and David W. Jacobs

Figure 5: Sequence of images depicting the formation of the completion field (C(u, v, φ ; t)) as a function of time. Frames 1–10 are totally dark, since insufficient time has elapsed for particles traveling from the two sources to reach each other. The completion field first appears in frame 11, at the point midway between both sources. In frames 12–20 the completion field rapidly spreads outward in both directions away from the midpoint and back toward the two sources.

Computation of Stochastic Completion Fields

875

Figure 6: (a) Kanizsa triangle. (b) Stochastic completion field computed by large-kernel convolution with filter generated by Monte Carlo simulation (see “Stochastic Completion Fields” elsewhere in this journal) (summed over all orientations and superimposed on the brightness gradient magnitude image). (c) Stochastic completion field computed by large-kernel convolution with filter generated by analytic expression (Thornber & Williams, 1996). (d) Stochastic completion field computed by repeated small-kernel convolution.

The third demonstration is the well-known Kanizsa triangle (Kanizsa, 1979) (see Figure 6a). The sources and sinks were identified using the method neuron’s increased firing rate. Because the wave has to arrive from both flanking elements, the function describing this latency should have the general form, max(d1 , d2 ) (where d1 and d2 are distances from the center of the neuron’s receptive field to each flanking element).

876

Lance R. Williams and David W. Jacobs

Figure 7: Numerical artifacts caused by anisotropy inherent in finite difference solution of the advection equation. (a) Completion field in neighborhood of “pacman” computed by large-kernel convolution. (b) Completion field in same neighborhood computed by repeated small-kernel convolution.

described in Williams and Jacobs (1997). Figure 6b shows the stochastic completion field computed using large-kernel convolution with a filter generated by Monte Carlo simulation (see our other article in this issue). The completion field is summed over all orientations and superimposed on the brightness gradient magnitude image for illustrative purposes. Figure 6c shows the same, but this result was computed with a filter generated by numerical integration of the analytic expression for G derived by Thornber and Williams (1996). This is by far the most accurate method of computing the stochastic completion field. Figure 6d shows the stochastic completion field computed using repeated small-kernel convolution. There are small but noticeable numerical artifacts due to the inherent anisotropy of the finite difference scheme for solving the advection equation (see Figure 7). The similarity between Figures 6b–d demonstrates that our characterization of the stochastic completion field is distinct from the algorithm (and neural network) that computes it. Although space considerations preclude our showing them here, we have rerun all of the experiments from our other article in this issue using the small-kernel algorithm and achieved results of similar quality. Finally, we have tested the small-kernel method on a picture of a dinosaur. The stochastic completion field, summed over all orientations and superimposed on an edge image for illustration purposes, is shown in Figure 8. The body of the dinosaur has been completed where it is occluded by the legs. In Figure 9 we demonstrate the effectiveness of the anisotropic decay

Computation of Stochastic Completion Fields

877

Figure 8: (a) Dinosaur image (Apatosaurus). (b) The body of the dinosaur has been completed where it is occluded by the legs.

function described in the previous section. This figure also illustrates the strong connection between the stochastic completion field and the active energy minimizing contours (“snakes”) of Kass, Witkin, and Terzopolous (1987). The stochastic completion field can be thought of as a probability distribution representing a family of snakes with given end conditions. This connection is strengthened by the inclusion of the anisotropic decay mechanism, since this device plays a role similar to the “external” energy term of the snake formulation. Figure 9a shows two probability density “waves” emitted from a pair of sources located on the dinosaur’s back. The source and sink fields were assumed equal and the time, t = 32. Figure 9b shows the stochastic completion field computed for this pair of sources. The mode of the stochastic completion field (the curve of least energy) lies significantly below the back of the dinosaur. Furthermore, the variance of the distribution is quite large. Figure 9c shows the probability density “waves” emitted from this same pair of sources but with particle half-life determined by the anisotropic decay function. The probability density is maximum where the wave intersects the back of the dinosaur. Figure 9d shows that the mode of the resulting completion field conforms closely to the back of the dinosaur and that the variance of the distribution is significantly reduced. In our last experiment, we demonstrate the effect of the anisotropic decay function using three variations of the Kanizsa square figure (Ramachandran et al., 1994). Although we compare the results of our simulation against the human percepts in each case, we make no special claims for the specific anisotropic decay mechanism described in this article. Our intention was not to propose a faithful model of human visual processing but to illustrate

878

Lance R. Williams and David W. Jacobs

Figure 9: The stochastic completion field can be thought of as a probability distribution representing a family of snakes with given end conditions. (a) Probability density at t = 32 (p(x, y, θ ; 32)) for a pair of sources placed on the back of the dinosaur. This was computed by the small-kernel method without anisotropic decay. (b) The mode of the stochastic completion field (the curve of least energy) lies significantly below the back of the dinosaur. (c) Probability density at t = 32 for same sources computed with anisotropic decay. The probability density is maximum where the wave intersects the back of the dinosaur. (d) The mode of the resulting completion field conforms closely to the back of the dinosaur.

that the local-parallel algorithm, unlike the large-kernel algorithm, allows illusory contour formation to be modulated by local image brightnesses. The first variation of the Kanizsa square, shown in Figure 10a, consists of four “pacmen” on a uniform white background. This figure is perceived as an illusory square occluding four black discs. The second, shown in Figure

Computation of Stochastic Completion Fields

879

Figure 10: (a) Kanizsa square. (b) Stochastic completion field magnitude along the line y = 64. (c) Kanizsa square with out-of-phase checkered background (see Ramachandran et al., 1994). (d) The enhancement of the nearest contrast edge is not noticeable unless the magnitudes are multiplied by a very large factor (approximately 1.0 × 106 ). (e) Kanizsa square with in-phase checkered background. (f) The peak completion field magnitude is increased almost threefold, and the distribution has been significantly sharpened.

880

Lance R. Williams and David W. Jacobs

10c, is the same except that a checkered background has been added. The phase of the checkered pattern is such that illusory contours joining the “pacmen” must traverse large brightness gradients. Ramachandran et al. report not only that the illusory square is suppressed by the addition of the out-of-phase checkered background but also that the edges of the checker pattern nearest the old illusory square are enhanced. The third variation, shown in Figure 10e, also places the “pacmen” on a checkered background, but this time the phase of the checkered pattern is such that the illusory contours encounter no large brightness gradients but instead coincide with the borders of the squares. Ramachandran et al. report that the illusory square is enhanced by the addition of the in-phase checkered background. In our simulations, the images were of size 128 × 128 with 12 discrete orientations. The positions and orientations of the sources and sinks (the corners of the “pacmen”) were defined manually. The diffusivity, σ 2 = 0.01 and the particle half-life, τ = 100. The magnitude of the stochastic completion field along the line y = 64 is displayed to the right of each of the Kanizsa squares. Figure 10b and 10f are scaled by the same factor, so the results may be directly compared. We note that the peak completion field magnitude is increased almost threefold by the addition of the in-phase checkered background and that the distribution has been significantly sharpened. The effect of adding the out-of-phase checkered background is that the stochastic completion field from Figure 10b is almost totally suppressed. To a first approximation, this is consistent with Ramachandran et al.’s observation. However, the enhancement of the nearest contrast edge, also reported in Ramachandran et al. (1994), is not noticeable unless the completion field magnitudes are multiplied by a very large factor (approximately 1.0 × 106 ). See Figure 10d. To summarize, the results of our simulations are largely consistent with the observations reported in Ramachandran et al. (1994), but the enhancement effect observed in the case of the out-of-phase checkered background is far smaller.

7 Conclusion We have shown how to model illusory contour formation using repeated, small-kernel convolutions. This method directly improves on several published models because of its greater efficiency (by several orders of magnitude) and because it allows edges of objects to be connected in a way that takes account of all relevant image intensities, forming illusory contours only when they are consistent with these intensities. Our work therefore helps to bridge the gap between work on understanding human performance on illusory contours and work in computer vision (e.g., snakes) on finding contours that are smooth yet faithful to the image intensities.

Computation of Stochastic Completion Fields

881

References Ames, W. (1992). Numerical methods for partial differential equations. San Diego: Academic Press. Feller, W. (1971). An introduction to probability theory and its applications (Vol. 2). New York: Wiley. Ghez, R. (1988). A primer of diffusion problems. New York: Wiley. Grossberg, S., & Mingolla, E. (1985). Neural dynamics of form perception: Boundary completion, illusory figures, and neon color spreading. Psychological Review, 92, 173–211. Guy, G., & Medioni, G. (1996). Inferring global perceptual contours from local features. International Journal of Computer Vision, 20, 113–133. Heitger, R., & von der Heydt, R. (1993). A computational model of neural contour processing, figure-ground and illusory contours. Proc. of 4th Intl. Conf. on Computer Vision, Berlin, Germany. Kanizsa, G. (1979). Organization in vision. New York: Praeger. Kass, M., Witkin, A., & Terzopolous, D. (1987). Snakes: Active minimum energy seeking contours. Proc. of the First Intl. Conf. on Computer Vision, London, 259–268. Marr, D. (1982). Vision. San Francisco: Freeman. Mumford, D. (1994). Elastica and computer vision. In C. Bajaj (Ed.), Algebraic Geometry and Its Applications. New York: Springer-Verlag. Niebur, E., & Worg ¨ otter, ¨ F. (1994). Design principles of columnar organization in visual cortex. Neural Computation, 6, 602–614. Nitzberg, M., & Shiota, T. (1992). Nonlinear image filtering with edge and corner enhancement. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14, 826– 833. Perona, P., & Malik, J. (1990). Scale space and edge detection using anisotropic diffusion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 12, 629–639. Ramachandran, V. S., Ruskin, D., Cobb, S., Rogers-Ramachandran, D., & Tyler, C. W. (1994). On the perception of illusory contours. Vision Research, 34(23), 3145–3152. Rock, I., & Anson, R. (1979). Illusory contours as the solution to a problem. Perception, 8, 665–681. Rubin, N., Shapley, R., & Nakayama, K. (1995). Rapid propagation speed of signals triggering illusory contours. Investigative Opthalmology and Visual Science (ARVO), 36(4). Thornber, K. K., & Williams, L. R. (1996). Analytic solution of stochastic completion fields. Biological Cybernetics, 75, 141–151. von der Heydt, R., Peterhans, E., & Baumgartner, G. (1984). Illusory contours and cortical neuron responses. Science, 224, 1260–1262. Weickert, J. (1995). Multi-scale texture enhancement. Conf. on Computer Analysis of Image and Patterns, Prague, Czech Republic, 230–237. Williams, L. R., & Jacobs, D. W. (1997). Stochastic completion fields: A neural model of illusory contour shape and salience. Neural Computation, 9, 837–858. Received April 1, 1996; accepted September 25, 1996.

Communicated by Peter Dayan

Optimal, Unsupervised Learning in Invariant Object Recognition Guy Wallis ¨ biologische Kybernetik, 72076 Tubingen, ¨ Max-Planck-Institut fur Germany

Roland Baddeley Department of Experimental Psychology, University of Oxford, Oxford OX1 3UD, UK

A means for establishing transformation-invariant representations of objects is proposed and analyzed, in which different views are associated on the basis of the temporal order of the presentation of these views, as well as their spatial similarity. Assuming knowledge of the distribution of presentation times, an optimal linear learning rule is derived. Simulations of a competitive network trained on a character recognition task are then used to highlight the success of this learning rule in relation to simple Hebbian learning and to show that the theory can give accurate quantitative predictions for the optimal parameters for such networks. 1 Introduction How might we learn to recognize an object irrespective of the precise relationship between viewer and object, characterized by viewing angle, distance, and translation? If we believe the view-centered approach to object recognition (Tarr & Pinker, 1989; Bulthoff ¨ & Edelman, 1992), then the problem becomes one of associating together a series of different views of the object that may share few, if any, of the features supporting the recognition. A broadly tuned feature-based system would be sufficient to perform recognition over small transformations (Poggio & Edelman, 1990), and the necessary receptive fields might be learned via a simple competitive network using Hebbian learning. However, large-shape transformations would require either separate prenormalization for size and translation or separate feature detectors feeding into a final arbitration layer. This then raises the question of how such a final layer might be trained, dismissing the use of any form of supervised training signal. The use of prenormalization and a final arbiter is in fact in contrast to the evidence we have from the responses of real neurons implicated in object recognition. Invariance seems to be established over a series of processing stages, starting from neurons with restricted receptive fields and culminating in the types of cell responses found in inferior temporal (IT) cortex (Perrett & Oram, 1993; Rolls, 1992). Cells in this region exhibit invariance Neural Computation 9, 883–894 (1997)

c 1997 Massachusetts Institute of Technology °

884

Guy Wallis and Roland Baddeley

to combinations of the types of transformations discussed here (Rolls, 1992; Desimone, 1991; Tanaka, Saito, Fukada, & Moriya, 1991). Miyashita (1988) has illustrated that arbitrary stimuli can be associated together by a single IT neuron in primates. The key to the training he used was the association of these images in time, not in space. The hypothesis expressed here is that temporal relations in the appearance of object views affect learning. In the real world, we see objects for protracted if variable periods while they undergo any manner of natural transformations, such as when we approach an object or rotate it in our hand. The consistent stream of images serves as a cue that all of these images belong to the same object and that it would be expedient to associate them together. This idea has been described elsewhere (Edelman & Weinshall, 1991; Foldi´ ¨ ak, 1991; Wallis & Rolls, 1997); in this article, we aim to build a theoretical framework for analyzing the response of a neuron over time. From this framework we then intend to derive the form of an optimal, local training rule given the general probability function governing the length of each exposure to images of the same object. 2 Mapping the Output of a Neuron to the Optimal Training Signal The most obvious signal present in a neuron for performing learning is the activity of the neuron itself. Unfortunately, this Hebbian training signal will not typically be optimal for learning-invariant object recognition due to erroneous classifications made by the neuron to spatially similar images from different objects and spatially dissimilar images derived from the same object. A potentially useful piece of a priori knowledge to overcome this problem is that objects tend to be viewed over variable but extended periods of time. In this work the consequent temporal structure will be used to provide information for improving the use of local neural activity as a training signal. This is achieved here by using a weighted sum of previous neuronal activity rather than by attending to the current neuronal output alone. Assuming the neuronal output to be a signal described by the function y(t), and the ideal training signal to be s(t), discrepancies between the two signals can be regarded as due to some noise signal n(t). Couched in these terms, the optimal recovery of s(t) from y(t) can be regarded as a classical filtering problem. The form of the optimal linear filter for retrieving s(t) can be derived by Wiener filtering (Wiener, 1949; Press, 1992).1 The Wiener filter, φ(t), gives an optimal estimate, s˜(t) (in the least-squares sense), of the true signal, s(t), when applied to the noise-contaminated output, y(t). Adopting the con-

1 There are presumably innumerable, more optimal nonlinear filters, which are beyond the scope of the Wiener filtering theory given here.

Learning in Invariant Object Recognition

885

vention that capital letters denote functions transformed into the frequency domain, the theorem states: ˜ f ) = Y( f )8( f ) S( 8( f ) =

(2.1)

|S( f )|2 |S( f )|2 + |N( f )|2

.

(2.2)

Hence, the optimal filter φ(t) can be determined directly from the estimated system noise n(t) and the true signal s(t). The following two sections consider the predicted optimal filter for three different forms of s(t), each describing a different regime for stimulus presentation. 2.1 Fixed-Length Presentation Times. As a simple first case, a set of stimuli are assumed to appear in identical, fixed-length sequences. Possible forms of the ideal training signal and neural output associated with this training regime appear in Figure 1. The form of s(t) is easily characterized, but there remains the question of how to represent the noise signal n(t). In these calculations it is assumed to be white, that is, uncorrelated across time, making N( f ) a constant for all f , denoted ρ. The validity of this assumption will be tested in the experimental section that follows. The power spectra of s(t) and n(t) are shown in Figure 2. The large lowfrequency bias of the training signal s(t) relative to the noise signal n(t) strengthens the hypothesis that a low-pass filter would be effective in removing noise in the training signal. Knowing the form of S( f ) and N( f ) leads to the following definition of the optimal filter 8( f ):

8( f, τ, T) =

                

τ2 τ 2 +ρ 2 T2

¡ πτ f ¢ ¡ ¢T 2 πτ f 4 sin2

4 sin

+π 2 f 2 ρ 2

T

0

:

f = 0;

:

f ∈ Z , f > 0;

:

otherwise,

(2.3)

where T represents the period of an entire training epoch, in which all stimuli are seen once, τ corresponds to the period over which all versions of a single stimulus are presented, and 1 is the time for a single stimulus presentation. Since the training signal is effectively sampled every 1 time units, all signal frequencies above the Nyquist frequency fN = 1/21 are subject to aliasing. Noise power is hence constrained to lie between zero and fN . Signal amplitude above fN is assumed negligible, since in general a1 À a fN .2 2

For example, if T = 1001 and τ = 101, a1 ≈ 50a fN .

886

Guy Wallis and Roland Baddeley

(a)

s(t)

~

τ

t

T

(b) y(t)

(c) ~s(t)

~

t

~

t

∆

Figure 1: (a) The ideal training signal s(t), which is one when the object is present and zero when absent. (b) Example of a possible neural output signal y(t). (c) Attempt to retrieve s(t) from y(t) by taking a weighted sum of the outputs y(t) at previous times.

The first diagram in Figure 3 depicts the optimal filters derived from the expression for 8( f ) over a range of noise values,3 for the example case τ = 101 and T = 1001. Weightings at each time step are proportional to ¯ the weighting of the cell’s activation at this time in the past, denoted φ(t). The level of noise clearly affects the form of the Wiener filter. In practice, as the neuron learns, its associated error rate will change, and with it, the form of the optimal linear filter. The large negative trough apparent in all of these filter profiles at time T/τ results from fixing the ratio τ/T. In the next section, this trough is seen to disappear when using variable-length presentation times. 3 ρ = 0.45 is equivalent to an error rate of 10 percent; ρ = 0.14 to an error rate of 1 percent.

Learning in Invariant Object Recognition

887

2

S (f)

2

Log Power

N (f)

fN

Frequency

Figure 2: S( f ) and N( f ) for a system where presentation times are fixed. fN = 1/21 is the Nyquist frequency.

0.8

ρ = 0.1

0.6

ρ= 0.5

0.4 0.2 2

4

6

8

Time

10

12

14

1 0.8

ρ = 0.1

0.6

ρ= 0.5

0.4 0.2 2

4

6

8

10

Time

12

14

Normalized Filter Weighting

1

-0.2

Fast Decay

Slow Decay Normalized Filter Weighting

Normalized Filter Weighting

Fixed

1 0.8

ρ = 0.1

0.6

ρ= 0.5 0.4 0.2 2

4

6

8

10

12

14

Time

Figure 3: Normalized Wiener filters φ(t) for the three presentation paradigms. The vertical axis represents the weighting accorded to each sample of neural activity; the horizontal axis represents the sample number in multiples of 1 time steps previous to the current sample at time zero.

888

Guy Wallis and Roland Baddeley

2.2 Variable-Length Presentation Times. In the real world one might expect to see objects over variable periods of time rather than for the fixed periods. Since presentation time is a scale parameter, it is reasonable to propose that the presentation times are in fact Jeffrey’s distributed (P(τ ) ∝ 1/τ or, equivalently, that log presentation times are uniformly distributed (see Jaynes, 1983), which implies that objects tend to be seen for short periods but are occasionally seen for much longer periods. In this section, new optimal filters are derived from two different presentation probability distributions. In the first, the maximum presentation time for a stimulus, τˆ , is set to be 1001, and the minimum presentation length to be 1. In the second, τˆ is reduced to 101. The period T is now distributed about a mean dependent on τˆ ; hence the mean value of T is given the symbol Tτˆ . Presentation-length probabilities are defined by a simple Jeffrey’s distribution extending within the range 1 < τ < τˆ , such that P(τ )|τ =τˆ = 0. If k is the number of stimulus classes, then we have: −1 − τˆ −1 ) : 1 ≤ τ ≤ τˆ P(τ ) = Z−1 τˆ (τ

Zτˆ =

τˆ X

(s−1 − τˆ −1 ) :

s=1

Tτˆ = k

τˆ X

τ P(τ ) :

τ =1

(2.4)

Z1001 = 4.187381 Z101 = 1.928971

T1001 = 11.82121k T101 = 2.332851k .

(2.5) (2.6)

The average optimal filter hφ(t)i is then: hφ(t)i =

τˆ X τ =1

P(τ ) φ(t)

(2.7)

 ¶ 2π f t  . P(τ ) 8( f, τ, Tτˆ ) cos = Tτˆ τ =1 f =0 τˆ X



∞ X

µ

(2.8)

The second two graphs of Figure 3 show the filters for these two random presentation distributions. Switching to variable-sequence presentation lengths sees the disappearance of the trough present in the fixedlength presentation case. In addition, the shorter average presentation-time paradigm yields temporal filters with shorter time constants. 2.3 The Trace Rule. This section describes a locally implementable learning rule that can realize the general form of all of the filters described in the previous section. The learning rule produces a running average of neural activity based on a weighted sum of the neuron’s previous activity. The simple recursive form of the learning rule may be important in that it lends itself to implementation locally within a cortical neuron (Wallis & Rolls, 1997).

Learning in Invariant Object Recognition

Slow Decay, τ^ = 100 ∆

Fixed, T=100 ∆ 1 0.013

0.8

η

0.014

1

0.8

0.8

0.011

η

0.6

0.4

0.4

0.2 0.0026

0.2

η

0.3

ρ

0.4

0.5

0.6 0.4 0.2

0.0012 0.2

Fast Decay, τ^ = 10 ∆

1 0.014

0.6

0.1

889

0.1

0.00049 0.00032 0.00024 0.00019 0.2 0.3 0.4 0.5

ρ

0.0035 0.0024 0.1

0.2

0.0015 0.3

ρ

0.0011 0.00085 0.4

0.5

Figure 4: Graphs showing the change in the predicted optimal value of η with changing error rate (black lines and circles). Also shown are the least-meansquares fit errors at each noise level, shown above the vertical bars. The thick arrows indicate the predicted optimal value of η over the range of noise values shown.

The learning rule has a long history but was used most recently in invariant object recognition for orthogonal images (Foldi´ ¨ ak, 1991) and nonorthogonal images (Wallis, 1996b), and can be summarized as follows: X wij 2 = 1 for each ith neuron (2.9) 1wij (t) = αyi (t) .xj : j

yi

(t)

= (1 − η)yi

(t)

+ ηyi (t−1)

(2.10)

where xj is the jth input to the neuron, yi is the output of the ith neuron, wij is the jth weight on the ith neuron, η governs the relative influence of the trace and the new input, and yi (t) represents the value of the ith cell’s trace at time t. The left-most diagram of Figure 4 represents the main result for this section. The optimal value of the trace rule parameter η is plotted as the dark line and varies as a function of the noise ρ. The vertical bars show the relative quality of fit of the filter implemented by the trace rule to the optimal filter as a least-mean-squares fit for the first 20 steps in the filter, with actual fitting errors appearing on the top of each bar. The optimal value of η gradually drops as the noise level drops but remains around 0.8 over a very wide range of noise, suggesting that a constant trace rule could be used throughout training. The remaining two graphs of Figure 4 show the change in the best-fitting value of η as a function of ρ, for the variable presentation times. The graphs are similar to the fixed presentation time case, except that the optimal value of η is a little higher for the slow-decaying case (between 0.8 and 0.9) and somewhat lower for the fast-decaying case (0.7). The average fitting error is also considerably smaller, indicating that the training rule fits the optimal filter more closely under the two probabilistic presentation regimes. The

890

Guy Wallis and Roland Baddeley

Training Set

Cross-validation Set

Figure 5: Architecture of the network used in the simulations and the two sets of digits used during training and testing.

accuracy of these and previous predicted optimal values of η is tested in the next section. 3 Simulating the Predictions of the Wiener Filtering A series of predicted optimal linear filters, parameterized for stimulus presentation length and the classification error rate of the neurons ρ, have now been produced. To gauge the accuracy of the predictions, a series of simulations were run. 3.1 Methods. A two-layer network was constructed (see Figure 5). The first layer acts as a local feature extraction layer and consists of a 16×16 grid of neurons arranged in 4 × 4 pools. Each pool fully samples a corresponding 4x4 patch of a 16 × 16 input image. All learning in this layer is standard Hebbian. Above the input layer is a second layer, consisting of a single inhibitory pool of 10 neurons, which fully samples the first layer. Neurons in this layer are trained with the trace rule. Competition acts within each pool in both layers and is implemented using the soft max algorithm (Bridle, 1990). Digits from a character set used by LeCun et al. (1989) were presented to the network. During learning, 100 digits—10 of each type—were presented in permuted random sequence according to the three stimulus presentation distributions described in the previous section. Such a stream of digits might

Learning in Invariant Object Recognition

Fast Decay, τ^ = 10 ∆

Slow Decay, τ^ = 100 ∆

Fixed, T=100 ∆

100

80 60 40 Training Set

20

80 60 40 Training Set

20

Cross Validation Set

0

0.2

0.4

η

0.6

0.8

Performance %

100

Performance %

100

Performance %

891

80 60 40 Training Set

20

Cross Validation Set

1

0

0.2

0.4

η

0.6

0.8

Cross Validation Set

1

0

0.2

0.4

η

0.6

0.8

1

Figure 6: Classification performance for the network at differing values of η for the three presentation-length paradigms. The thick arrows indicate the optimal value of η predicted by the earlier theory, which are seen to be in accord with the optimal values to come out of the simulations.

be generated in the real world while observing a digit printed in a book, by altering the reading distance or tilting a page. Although we have chosen to concentrate on learning deformation invariance of digits here, this system has been shown to be able to cope with other transformations of more natural stimuli, such as faces, undergoing depth rotations, scale changes, and translations (Wallis & Rolls, 1997; Wallis, 1996a). Network performance was measured as both recognition of the 100 digits from the training set and a 100 digit cross-validation set from the same database. In addition to these two performance measures, the amount of noise present in a neuron’s output signal when trained from start to steady response was also recorded. 3.2 Results. The results shown in Figure 6 reveal the effect on overall classification performance of changing the value of the trace parameter η over 10 runs of the network. Under each of the three presentation paradigms, the optimal value of η predicted in the previous section is seen to correspond very closely to the peak in the performance graph displayed alongside (see the dark arrows in each of the six graphs). Therefore, despite two possible theoretical problems (the assumptions of independent errors and of linearity), the predictions from the theory are very close to the results of the simulations. For η = 0 the results correspond to simple Hebbian learning. These results are clearly much worse than those achieved using the optimized trace rule. Further comparisons with other architectures and other learning rules are described elsewhere (Wallis, 1996b). The forerunning calculations depend on the errors’ being approximately random. Figure 7 shows the average power spectrum recorded for a neuron

892

Guy Wallis and Roland Baddeley

Slow Decay

Fixed

Fast Decay

10 0

10−1

10 0

Signal Noise

Log Power

Log Power

Signal Noise

10−1

Log Power

10 0

10−2

10−2

10−3

0

5

10

15

20

25

30

35

40

10−3

0

45 50

5

10

15

Freq

20

25

30

35

40

10−2

10−3

45 50

Signal Noise

10−1

0

5

10

15

20

25

30

35

40

45 50

Freq

Freq

Figure 7: Average log power spectrum for noise n(t) and ideal training signal s(t) in the response of neurons under each of the three presentation-length training paradigms.

Slow Decay

0.8

white noise filter

0.6

true noise filter

Normalized Filter Weighting

Normalized Filter Weighting

1

0.4

0.2

0

−0.2

0

5

10

Time

15

Fast Decay

1

Normalized Filter Weighting

Fixed

0.9

white noise filter

0.8 0.7

true noise filter

0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10

Time

15

1 0.9

white noise filter

0.8 0.7

true noise filter

0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10

15

Time

Figure 8: Optimal filters calculated from the measured neural noise under each of the three presentation-length training paradigms. The noise signal used to calculate the filters was assumed to be either “white,” that is, uniform across all frequencies (dashed line), or the true low-frequency biased signal (solid line).

monitored from a random starting point to the point at which a steady-state response had been achieved. The spectrum is averaged across all 10 output cells and also across time, with the signals being binned into 100 consecutive stimulus presentations. All three graphs show a peak in the low-frequency end of the noise spectra, although the power of this signal is considerably less than the signal power, and the spectrum is essentially flat for all higher frequencies up to and beyond the point at which noise power exceeds signal power, in accordance with the form of the graph in Figure 2. Knowing the true noise signal permits recalculation of the optimal filters. Figure 8 shows the optimal filters obtained using the true low-frequency biased noise spectrum versus using a flat (white) spectrum, taken as the average of the true noise power across all frequencies. The optimal filter for the true noise case is very similar to the white noise case, although the decay through time is consistently slightly faster. The

Learning in Invariant Object Recognition

893

fact that this difference is consistent may well be attributed to the fact that one effect of the peak at low frequencies in the true noise signal is to shift the frequency at which the noise and signal spectra cross in comparison with the crossing point for the flat, white noise case. The white noise curve would cross the signal curve at the same frequency as in the true noise case if its power was reduced slightly, and so the change from white to true noise can be thought of as a reduction of the effective white noise amplitude. Figure 4 showed how lower noise values correspond to smaller time constants, consistent with the shift seen in Figure 8. Despite this shift, the stability of the optimal value of η with respect to changes in noise (see Figure 4) ensures that the predicted optimal value of η never differs by more then 0.05 between the white and true noise cases, supporting the use of a flat noise spectrum in the theory section of this article. 4 Conclusions This article has proposed that objects encountered in the world are presented in sequences of views of the same object, not randomly. This temporal regularity can then be used to help form invariant representations by training on the basis of a signal generated from a weighted sum of previous neural activity, the form of which can be determined by optimal filtering theory. Further, if we make the reasonable assumption that the log of presentation times is uniformly distributed, then this weighting can be achieved by a simple local learning rule. The validity of this theory has been tested on the recognition of digits, demonstrating that optimal parameters were well predicted and that assumptions about the form of the noise and system linearity were justified. In cases where we have explicit labels for inputs, supervised learning techniques may well be preferable, but in the natural world, the visual input to the brain often does not have such labels. In this case, any regularity in the input (spatial or temporal) should be exploited by an unsupervised system to learn useful representations. Much work has concentrated on exploiting spatial correlations. Here we show that temporal regularities can also be exploited near optimally using the simple and local trace rule. Acknowledgments We are grateful to two anonymous reviewers and this journal’s editor in chief for helping to improve the clarity and depth of this article considerably. References Bridle, J. S. (1990). Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters.

894

Guy Wallis and Roland Baddeley

In D. S. Touretzky (Ed.), Neural Information Processing (Vol. 2, pp. 211–217). San Mateo, CA: Morgan Kaufmann. Bulthoff, ¨ H., & Edelman, S. (1992). Psychophysical support for a twodimensional view interpolation theory of object recognition. Proceedings of the National Academy of Science, USA, 92, 60–64. Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3, 1–8. Edelman, S., & Weinshall, D. (1991). A self-organising multiple-view representation of 3D objects. Biological Cybernetics, 64, 209–219. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Jaynes, E. T. (1983). Papers on probability, statistics and statistical physics. Dordrecht: Reidel. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551. Miyashita, Y. (1988). Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature, 335, 817–820. Perrett, D., & Oram, M. W. (1993). Neurophysiology of shape processing. Image and Vision Computing, 11(6), 317–333. Poggio, T., & Edelman, S. (1990). A network that learns to recognize threedimensional objects. Nature, 343, 263–266. Press, W. H. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Rolls, E. T. (1992). Neurophysiological mechanisms underlying face processing within and beyond the temporal cortical areas. Philosophical Transactions of the Royal Society, London (B), 335, 11–21. Tanaka, K., Saito, H., Fukada, Y., & Moriya, M. (1991). Coding visual images of objects in the inferotemporal cortex of the macaque monkey. Journal of Neurophysiology, 66, 170–189. Tarr, M. J., & Pinker, S. (1989). Mental rotation and orientation-dependence in shape recognition. Cognitive Psycholgy, 21, 233–282. Wallis, G. (1996a). Temporal order in object recognition learning. To appear in Journal of Biological Systems. Wallis, G. (1996b). Using spatio-temporal corrolations to learn invariant object recognition. Neural Networks, 9, 1513–1519. Wallis, G., & Rolls, E. T. (1997). A model of invariant object recognition in the visual system. Progress in Neurobiology, 51: 167–194. Wiener, N. (1949). Extrapolation, interpolation, and smoothing of stationary time series: With engineering applications. New York: Wiley.

Received February 8, 1996; accepted September 4, 1996.

Communicated by Zhaoping Lee

Activation Functions, Computational Goals, and Learning Rules for Local Processors with Contextual Guidance Jim Kay Department of Statistics, University of Glasgow, Glasgow G12 8QQ, Scotland, UK

W. A. Phillips Centre for Cognitive and Computational Neuroscience, University of Stirling, Stirling FK9 4LA, Scotland, UK

Information about context can enable local processors to discover latent variables that are relevant to the context within which they occur, and it can also guide short-term processing. For example, Becker and Hinton (1992) have shown how context can guide learning, and Hummel and Biederman (1992) have shown how it can guide processing in a large neural net for object recognition. This article studies the basic capabilities of a local processor with two distinct classes of inputs: receptive field inputs that provide the primary drive and contextual inputs that modulate their effects. The contextual predictions are used to guide processing without confusing them with receptive field inputs. The processor’s transfer function must therefore distinguish these two roles. Given these two classes of input, the information in the output can be decomposed into four disjoint components to provide a space of possible goals in which the unsupervised learning of Linsker (1988) and the internally supervised learning of Becker and Hinton (1992) are special cases. Learning rules are derived from an information-theoretic objective function, and simulations show that a local processor trained with these rules and using an appropriate activation function has the elementary properties required. 1 Introduction Many studies have shown that useful computational capabilities can arise from local processors whose outputs are a function of just the weighted sum of their inputs. It has also been shown that these capabilities can be enhanced by providing the local processors with specific contextual information that affects processing in a way quite different from that of the primary driving inputs. For example, Becker and Hinton (1992) have shown how specific information about context can be used to guide learning. They assumed multiple processing channels that operate on different input data sets across which there is statistical dependence. The aim of learning was to discover functions of the input data that make this statistical dependence Neural Computation 9, 895–910 (1997)

c 1997 Massachusetts Institute of Technology °

896

Jim Kay and W. A. Phillips

easier to compute. To do this, the mutual information in the outputs of the separate processors was maximized. The channels therefore communicated information to each other about their current outputs, but this information was used only for learning and had no effect on the short-term processing. However, contextual information can also be used to guide processing. For example, Hummel and Biederman (1992) have shown how it could be used to guide processing in a large neural net for object recognition. In their case, the contextual connections, called fast enabling links, were used to determine the phase with which an oscillatory output would be produced but without having any effect on the amplitude. This is important because it was the amplitude of the signal that conveyed information about the extent to which the receptive field inputs met the featural requirements of the local processor. Many detailed psychological findings support the approach of Hummel and Biederman (1992). Analogous proposals have emerged from the discovery of context-sensitive synchronization between the spike trains of cortical cells (Singer, 1990; Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992; Eckhorn, Reitboeck, Arndt, & Dike, 1990) and of a basis for it in horizontal intrinsic connections (Lowel ¨ & Singer, 1992). Singer (1990) and Engel et al. (1992) call the contextual inputs coupling connections, and Eckhorn et al. (1990) call them linking connections. The aspect of these connections that is crucial here is that they help specify whether a signal is relevant in some immediate context, but they do not change what it means. From a statistical point of view, the motivation for studying processors with contextual guidance is that they might implement some form of latent structure analysis that discovers the predictive relationships implicit in separate data sets. Some nets can be described as discovering the principal components of variation in the receptive field inputs. Becker and Hinton (1992) describe their algorithm as providing a nonlinear version of canonical correlation. Fisher (1936) developed a method for the extraction of canonical variates, using training data, that provided optimal linear directions of discrimination between underlying feature classes. If class labels are one of the data sets, then the method of canonical correlation (Hotelling, 1936) may be used to provide the Fisher discriminant functions. This is relevant to the case of supervised learning with an external teacher. However, as Kay (1992) showed, canonical correlation can be used in a neural network for the extraction of linear latent variables under contextual supervision. This suggests that discovery of predictive relationships between diverse data sets could be one goal of the cerebral cortex and that this goal can be formulated at the level of local circuits and can be usefully combined with the compressive recoding of the receptive field input data. This article has four aims: (1) to show how local processors can use contextual input in a way that does not confound it with the information transmitted about the receptive field; (2) to show how finding predictive relations between data sets can be combined with the compressive recoding of the receptive field input data; (3) to apply gradient ascent to the formally specified

Functions, Goals, and Rules for Local Processors

897

goals to derive learning rules for the receptive and contextual field weights; and (4) to describe simulations that test whether such a local processor has the basic properties required. There is a strong tradition in studies of cortical computation that proposes the reduction of redundancy as a major organizing principle (Barlow, 1961; Atick & Redlich, 1993). This goal can be specified locally and can be formulated as that of maximizing the mutual information between the input and the output of local processors under certain constraints (e.g., Linsker, 1988). Because this goal is formulated for local processors without contextual inputs, however, it cannot take context into account. “In a complex network, or in an animal’s brain, it is totally unclear how a component is to ‘decide’ what transformation its connections should perform. If a local optimization principle is to be used—one that does not take account of remote high level goals—then we do not know what information is going to be needed at high levels. Since we don’t know what information we can afford to discard, it is reasonable to preserve as much information as possible within the imposed constraints” (Linsker, 1988, p. 116). Given this view, it is then necessary to assume that selection of the information that is relevant within specific contexts must occur at some later and quite distinct stage of processing. The possibility being studied here is that contextual inputs provide local processors with a way of distinguishing relevant from irrelevant information even at early stages of processing. This contextual information does not necessarily have to come from remote high-level goals, however, but could be a specific local context appropriate to the position of the local processor within the system as a whole. Becker and Hinton (1992) have shown how maximizing the mutual information between the outputs of units that receive their inputs from diverse proximal data sets can provide internal supervision that enables local processors to discover distal variables that are only implicit in the input data. They formulate this goal in informationtheoretic terms. Linsker (1988) also uses an information-theoretic approach, but his goal was to maximize the mutual information between the inputs and the outputs of each local processor, without taking any contextual information into account. Section 3 shows how these two goals may be seen as specific points within a large space of possible goals. Throughout this article, it is assumed that the units of a given local processor are probabilistic with their values being described by random variables. We denote the values of the m receptive field (RF) units and n contextual field (CF) units, respectively, by R1 , R2 , . . . , Rm and C1 , C2 , . . . , Cn and we use the vector notation R and C. We make no assumptions regarding the probabilistic mechanism that produces the values of R and C; instead the empirical distributions of the input data are used, thus freeing us from rigid probabilistic modeling and allowing greater generality. It is assumed that the output of the local processor is represented by a binary random variable

898

Jim Kay and W. A. Phillips

X, with conditional output probability, Pr(X = 1|R = r, C = c) =

1 , 1 + exp(−A(sr , sc ))

(1.1)

P Pn where sr = m i=1 wi ri − w0 and sc = i=1 vi ci − v0 denote, respectively, the integrated RF and CF input fields, the {wi } and the {vi } denote the weights on the connections between the output and the RF and CF inputs, respectively, w0 and v0 are the RF and CF biases, A denotes a function that specifies the internal activation, and we have taken the scale parameter of the logistic nonlinearity to be unity. 2 Activation Functions with Contextual Guidance Becker and Hinton (1992) did not allow the contextual information to affect the short-term dynamics because they did not want the separate processors to maximize the mutual information in their outputs simply by driving each other. The use of context to affect learning without affecting ongoing processing is not only biologically implausible, however, but it also fails to use context to guide processing. We therefore require an activation function that possesses the following properties: if the integrated RF input is zero, then the activation should be zero; if the integrated CF input is zero, then the activation should be the integrated RF input; if the integrated RF and CF inputs agree, then the gain of the function relating activation to RF input should be increased; if the integrated RF and CF inputs disagree, then the gain of the function relating activation to RF input should be decreased; only the integrated RF input should determine the sign of the activation so that context cannot affect the direction of the output decision. The following function possesses all of these properties, A(sr , sc ) =

1 sr (1 + exp(2sr sc )) 2

(2.1)

and was used in the experiments described below. It is one of a class of functions of the form sr (k1 + (1 − k1 ) exp(k2 sr sc )), where k1 and k2 are constants (0< k1 <1, k2 >0). This class of functions was motivated in part by the assumption that the effect of context should depend on the prevailing state of activation and was also derived mathematically from the above requirements (Kay, 1994). This class of functions does not uniquely encapsulate the above functional requirements but nevertheless is sufficient. The activation function is illustrated in Figure 1. The activation function (see equation 2.1) is not intended to translate directly into neurophysiology, and we do not know whether any translation is possible. We do know, however, that cortical pyramidal cells in general

Functions, Goals, and Rules for Local Processors

899

Output Probability 0 .2 .4 .6 .8 1

A

4 Int 2 eg ra 0 ted CF -2 Inp -4 ut

-4

2 t 0 Inpu -2 ed RF at r g e t

4

In

0.8 0.6 0.4

Output Probability

0.0

0.2

0.8 0.6 0.4 0.2 0.0

Output Probability

1.0

C

1.0

B

-2

-1

0 Integrated RF Input

1

2

-2

-1

0

1

2

Integrated CF Input

Figure 1: (A) The surface of output probability with respect to the integrated RF and CF inputs. It is apparent when the RF and CF inputs agree that appropriate saturation takes place. The inbuilt asymmetry of the activation function is clear when one considers the quadrants where the RF and CF inputs disagree: When the RF value is positive and the CF value negative, the rate at which the probability reaches saturation as the RF value increases is depressed only slightly, whereas when the RF and CF values are reversed, the high, positive CF value cannot reach saturation with a probability of 1. (B) The output probability plotted as a function of RF activity for five fixed values of the CF activity. These cross-sectional curves are constrained to pass through the point where the probability is 0.5 and the RF is zero and they do not intersect. We see that the CF activity boosts the rate of saturation of the probability. (C) A cross-section through the probability surface for five fixed values of the RF activity. This part is strikingly different from (B). This illustrates that the probability curves cannot pass through the 0.5 barrier and that appropriate saturation cannot be achieved by the CF activity alone.

900

Jim Kay and W. A. Phillips

have voltage-dependent gain control channels. Furthermore, Hirsch and Gilbert (1991) have shown that the long-range horizonatal collaterals in V1 have a voltage-dependent rather than a driving synaptic physiology and can learn (Hirsch & Gilbert, 1993). These collaterals join pyramidal cells with nonoverlapping RFs and are a primary example of what we call contextual inputs. 3 Information-Theoretic Objective Functions To show how the use of contextual inputs extends the range of possible goals, we compare the computational goals proposed by Linsker (1988) and Becker and Hinton (1992). Linsker’s Infomax goal is to maximize the mutual information I(X; R) between the RF inputs, R, and the output X. Becker and Hinton’s explicitly stated goal was to maximize I(X; C), that is, the mutual information between the output of one channel and the contextual inputs from other channels. However, their implicit intention was to maximize the shared information between X, R, and C. To express this explicitly we introduce the concept of three-way mutual information, which is defined by I(X; R; C) = I(X; R) − I(X; R|C)

(3.1)

and equivalent versions (see Figure 2). Note that in the general form of definition (see equation 3.1), X may also be a vector and that this definition can be extended to n-way mutual information (Kay, 1994). The various information components can be most easily seen with the aid of a diagram (see Figure 2). When contextual connections from neighboring channels exist, we may consider definition 3.1 in the form I(X; R) = I(X; R; C) + I(X; R|C) so that the information shared by the output and RF inputs may be decomposed into that shared also with the CF, that is, I(X;R;C), and that which is not shared with the CF, that is, I(X;R |C). Similarly, I(X ;C | R) denotes the information that is transmitted about the contextual inputs that is not shared with the RF activity. The term H(X | R,C) denotes the conditional Shannon entropy in the output given RF and CF inputs and represents the output information shared with neither the RFs nor the CFs. It can be seen from Figure 2 that the information in the marginal output distribution may be decomposed as follows: H(X) = I(X; R; C) + I(X; R|C) + I(X; C|R) + H(X|R, C). We would normally wish to decrease I(X;C | R) and H(X|R,C), but we desire I(X;R;C) to grow and possibly also I(X;R |C). A class of objective functions that specifies these computational goals is given by the following: F = I(X; R; C) + φ1 I(X; R|C) + φ2 I(X; C|R) + φ3 H(X|R, C).

(3.2)

Functions, Goals, and Rules for Local Processors

X

901

C

H(C)

R

I(X; C | R) H(X)

H(X | R,C) I(X; R;C)

I(X; R | C)

H(R)

H(X) = I(X; R;C) + I(X; R | C) + I(X; C | R) + H(X | R,C)

Figure 2: The decomposition of the information in the output into four disjoint components. Each circle represents the total information in each component of the local processor. The inset shows the flow of information through the processor.

The parameters φ1 , φ2 , and φ3 express the importance of their respective information components relative to the three-way mutual information I(X;R;C). We now focus on the subclass of objective functions where φ2 = φ3 = 0. Taking φ1 =1, and so giving the terms I(X;R;C) and I(X;R|C) equal importance, we obtain from equation 3.1 that F = I(X;R). This is Linsker’s Infomax objective function; but note that in order to obtain a formal equivalence to his computational goal, the contextual connection would require being set to

902

Jim Kay and W. A. Phillips

zero. Becker and Hinton intended to maximize I(X;R;C), which is achieved by taking φ1 = 0. Hence, φ1 is a critical parameter; allowing it to increase from 0 to 1 makes it possible to specify the relative importance to be afforded to maximizing information transmission within channels as compared with maximizing predictability across channels. Schmidhuber and Prelinger (1993) describe an algorithm for discovering predictable classifications. Taking φ1 = 1 − ², φ2 = ², and φ3 = 0 in our system gives the the objective function F = ²I(X; C) + (1 − ²)I(X; R), which is an information-theoretic analog of their objective function. Thus, a general class of objective functions has been introduced that subsumes the computational goals of Linsker (1988) and Becker and Hinton (1992) as special cases and allows hybrid possibilities. 4 Learning Rules Learning rules for the modification of the RF and CF weights were derived using gradient ascent. A summary is presented in the appendix. The rules are ¿ À ∂A ∂F ¯ = (ψ3 A − O)p(1 − p) R (4.1) ∂w ∂sr R,C ¿ À ∂A ∂F ¯ = (ψ3 A − O)p(1 − p) C ∂v ∂sc R,C

(4.2)

where O¯ = log

ER EC E − ψ1 log − ψ2 log . (1 − E) (1 − ER ) (1 − EC )

In practice, empirical averages are taken over input patterns, which means that no explicit probabilistic modeling of inputs is required; also it is not necessary to learn explicitly the joint probability distribution of R and C; all that is required is to learn the expectations E, ER , and EC . For example, EC is the average output probability taken over all RF input patterns, for which the contextual inputs are equal to C. The weight changes 1w and 1v are taken to be proportional to the partial derivatives in equations 4.1 and 4.2, respectively. These equations are written for batch learning; online learning is performed by removing the averaging brackets. It is important to stress that the required weight changes and also the expectations may be calculated recursively and updated after the presentation of each pattern, and so online learning may be achieved without making a double pass through the data within each epoch (Kay, 1994). Online learning was used successfully

Functions, Goals, and Rules for Local Processors

903

in the experiments described in section 5. The term O¯ represents a dynamic average that can be influenced by the current CF inputs as well as by the current RF inputs. The term p(1 − p) provides intrinsic weight stabilization. The partial derivatives of activation function (see equation 2.1) are given by µ ¶ 1 1 ∂A + sr sc exp(2sr sc ) = + ∂sr 2 2 ∂A = s2r exp(2sr sc ). ∂sc The weight change specified by these rules is nonmonotonically related to postsynaptic activity in a similar way to that proposed by Bienenstock, Cooper, and Munro (1982). It is also similar to the simpler form of nonmonotonicity found by Artola, Brocher, and Singer (1990) in studies of plasticity in slices of adult rat neocortex and which has been shown to have useful computational properties by Hancock, Smith, and Phillips (1991). 5 Experiments and Discussion We now provide two different illustrations of the role of contextual guidance in the processing of receptive field data. The following experiments were conducted using a net with two channels, each having its own receptive field inputs and a single output unit, with contextual connections between the outputs. Hence, the output unit within each of the channels is an example of a local processor as defined in this article. In all of the experiments online learning was used; the initial weights were generated randomly from the uniform distribution on the interval [−0.01, 0.01]; the learning rate per input pattern was taken to be 0.5 divided by the number of input patterns; the input patterns were presented randomly to the network; the network was set to learning mode for 1000 epochs, by which time the objective function had stabilized in all of the experimental runs; and the means of the random variables representing the outputs of the local processors were communicated to each other in order to provide mutual contextual guidance. 5.1 Discovery of Latent Structure Using Contextual Guidance. The purpose of the experiments described in this subsection is fourfold: (1) to show that when the three-way mutual information is used as the objective function for each local processor, the net can “discover” the variable that is predictably related across channels, but is not the most informative variable within channels, even when the cross-channel correlation is weak; (2) to show that this variable is discovered more quickly when it is also the most informative variable within channels; (3) to demonstrate that the use of Infomax on the combined receptive field data (treated as a single data set) fails to discover the relevant variable; and (4) to demonstrate that a number

904

Jim Kay and W. A. Phillips

of other obvious activation functions do not produce the desired solution in experiment 1. Four distinct binary input patterns were generated comprising positive and negative horizontal and vertical bar patterns of size 5 × 5, the sign of the pattern being determined by the sign of the central horizontal row or vertical column of the input patterns, with all other entries of the patterns having the opposite sign. In experiment 1, a batch of 100 input patterns was used, consisting of an equal number (14) of positive and negative horizontal bar patterns and an equal number (36) of positive and negative vertical bar patterns, and these 100 patterns were presented randomly within both channels. The patterns were presented so that the horizontal bar pattern was present in both channels on 28 percent of occasions and its sign was perfectly correlated across channels. In the other 72 percent of presentations, the vertical bar pattern was presented to both channels, but its sign was uncorrelated across channels. When the objective function at each output was the three-way mutual information (φ1 = 0), the RF weights in both channels converged to the pattern displayed in Figure 3A (or its negative) and both processors signaled the sign of the horizontal bar pattern. That is, the output probability obtained when the positive and negative horizontal bar was presented to the network was 1 and 0 (or vice versa), respectively, while the vertical bar patterns produced values close to 0.5. On the other hand, when the objective function was set to Infomax (φ1 = 1 and the cross-channel contextual connections fixed at 0), the result was different. The RF weights in both channels converged to the pattern shown in Figure 3C (or its negative), but this time the sign of the input pattern was signaled by the processors without any distinction as to whether the horizontal or vertical bar pattern was present. That is, the output probability obtained when the positive horizontal and vertical bar patterns were presented was 1, and it was 0 for the negative patterns (or vice versa). Thus, we conclude in the Infomax case that the four distinct input patterns are clustered into two groups defined solely by the sign of their bar pattern, but the presence of a horizontal or vertical bar pattern cannot be distinguished. On the other hand, when the three-way mutual information is the objective function, the sign of the horizontal bar is signaled by each processor; this is the variable correlated (weakly) across channels and indicates the relevance of the horizontal bar pattern, as opposed to the vertical one, with respect to the current context. This demonstration illustrates the critical role played by the φ1 parameter in the objective function and the role of contextual guidance in selecting and signaling the relevant variable in the receptive field data. The patterns shown in Figure 3 are very close to those obtained by performing a traditional principal component analysis on the data and are, approximately, the second (A) and first (C) principal components, respectively. In experiment 2, the relative frequencies of presentation of the horizontal

Functions, Goals, and Rules for Local Processors

905

4 3 2 1

1

2

3

4

5

B

5

A

1

2

3

4

5

1

2

4

5

4

5

4 3 2 1

1

2

3

4

5

D

5

C

3

1

2

3

4

5

1

2

4 3 2 1

1

2

3

4

5

F

5

E

3

2

4

6

8

10

2

4

6

8

10

Figure 3: (A) The converged RF weights when the sign of the horizontal bar pattern is correlated across channels but not the most informative variable within channels, and the computational goal was to maximize the three-way mutual information. (C) The converged RF weights when the computational goal was Infomax. (B, D) The signs of the weights in (A) and (C), respectively. The converged weights when the RF and CF inputs were concatenated and presented in a single channel, and the objective was Infomax, are displayed in (E) and their signs in (F).

and vertical patterns were reversed, with a horizontal and vertical bar pattern present on 72 percent, and 28 percent, of occasions, respectively. In this case, the sign of the horizontal bar is the most informative variable within channels as well as being correlated across channels. In both cases, when the three-way mutual information and Infomax were employed as objective functions, the RF weights converged to a pattern of the form in Figure 3A; when the three-way mutual information was used, the speed of learning was greater than in the previous experiment, as is illustrated in Figure 4. We divided the input data into two distinct channels, which we termed the RF and CF inputs. In experiment 3, we reconsider the first experiment but present the combined RF and CF input data within a single channel and

Jim Kay and W. A. Phillips

0.0

0.05

Three-Way Mutual Information 0.10 0.15 0.20 0.25

0.30

906

0

200

400 600 Number of Epochs

800

1000

Figure 4: The three-way mutual information during the course of learning in the experiments involving the horizontal and vertical bar patterns. The upper two graphs show the objective function for both local processors in the case where the sign of the horizontal bar is correlated across channels and also the most informative variable within channels. The lower graphs were obtained in the case where the sign of the horizontal bar was correlated across channels but not the most informative variable within channels.

attempt to discover the sign of the horizontal bar pattern using Infomax as the objective function. Figure 3E shows a typical example of the stabilized RF weights in this case. There are six input patterns. The output probability was 1 when the 5 × 10 pattern was a positive horizontal bar pattern, a concatenation of two positive vertical bar patterns and a concatenation of a negative with a positive vertical bar pattern and 0 for the other three inputs; thus, the stabilized net fails to signal unambiguously the sign of the horizontal bar. Of course, this is hardly surprising. After all, there is nothing in the Infomax goal to signify the additional information that the sign of the horizontal bar pattern is the relevant variable; that is where contextual guidance is required. This simple experiment demonstrates the value of separating the input data into two separate channels.

Functions, Goals, and Rules for Local Processors

907

Finally, in experiment 4, experiment 1 was conducted using different separable activation functions: sr + sc , sr sc , sr + sr sc , and sr exp(sc ). With the exception of sr sc , with which no learning happened, the use of these activation functions failed to produce the desired result: the signaling of the sign of the horizontal bar pattern. An Infomax-type solution was obtained with the output probability being 1 for both positive patterns and 0 for the negative ones (or vice versa). This demonstrates that a nonseparable activation function is required and that the type defined in equation 2.1 is sufficient. 5.2 Contextual Guidance from a Stochastic Supervisor. The purpose of this experiment is to discover whether the provision of incomplete information about the “true” class of an input pattern, in the form of contextual guidance, improves on the performance of unsupervised classification. Real data were used from a study of a rare hypertensive sydrome (Conn’s syndrome) which can be due to either a benign tumor in the adrenal cortex (type A) or bilateral hyperplasia of the adrenal glands (type B) (Brown, Fraser, Lever, & Robertson, 1971). Data are available on 31 patients comprising age, blood pressures, and five blood plasma measurements (Aitchison & Dunsmore, 1975), thus providing eight-dimensional input patterns; 20 of the patients are known, postsurgery, to have type A and the remaining 11 type B. The data for each variable were recoded as +1 for an observation above the mean value and −1 otherwise. Real examples raise an important computational issue for the approach defined in this article: that it is required to store a conditional average ER for each distinct input pattern. Clearly with large data sets, this could produce a serious computational overload. However, it transpires that this is not a problem at all; there is an exact simplification of the mathematics in this case. Each distinct input RF pattern corresponds to only one CF pattern, and hence the conditional average ER becomes the current output probability, and so this potential storage problem disappears (Kay, 1994). It is not even necessary to rewrite this case as a special form of the learning rules as the recursive, online nature of the algorithm takes care of it automatically. In the experiment, the RF data were the 31 eight-dimensional binary patterns. The CF data were generated as follows. The true class types were represented as (0,1) or (1,0) patterns. These were linearly transformed to (−1, 1) and (1, −1), respectively. Then each of these patterns was multiplied by a random number generated from the uniform distribution on the interval [0.2, 0.8]. This corresponds to the assumption that the probability for the correct class varies randomly and uniformly between 0.6 and 0.9. These patterns provided incomplete information as to the true class of each RF input pattern, and these CF inputs provided stochastic supervision. When the computational goal was Coherent Infomax, the architecture employed consisted of eight RF inputs and two CF inputs in two different channels. Each channel had an output unit, and these provided each other with cross-

908

Jim Kay and W. A. Phillips

channel contextual guidance. When the computational goal was Infomax, the stochastic units were combined with the data within a single RF in order to provide a fairer comparison with Coherent Infomax. The network was run with the objective function set to Infomax (φ1 = 1) and to the three-way mutual information (φ1 = 0), but in each case the weights were initialized from the same random numbers. This procedure was performed five times, each time using a different seed of the random number generator. The results were as follows. In Infomax mode (completely unsupervised learning) the output probabilities of the patterns saturated and so divided the inputs into two groups. Comparing these results with the knowledge about the true classes, the number of misclassification errors was 7, 7, 8, 8, 8 on the five runs, and the average misclassification rate was 24.5 percent. On the other hand, when contextual guidance was applied, using the three-way mutual information as the objective function, the same five patterns were misclassified on all runs, giving an average misclassification rate of 16.1 percent. Hence this experiment demonstrates that the provision of additional incomplete information about the appropriate classification of the input RF patterns, in the form of stochastic supervision via contextual guidance, can provide a more appropriate grouping of the RF input patterns. Furthermore, this may be achieved in practice using the methodology introduced in this article. 5.3 Conclusions. These studies show that local processors can discover functions of the RF inputs that are predictably related to the context in which they occur and can use those predictions to guide processing without confusing them with the information transmitted about the RF. They also demonstrate that the methodology introduced here possesses the required computational capabilities. The methodology has recently been extended to deal with the incorporation of multiple layers and local processors having multiple output units. The capabilities of these more complex networks are currently under investigation and will be reported. Acknowledgments The contribution of W. A. Phillips to this work was funded in part by a Human Capital and Mobility Network grant from the European Community, contract number CHRX-CT93-0097. Appendix For further details of these derivations, see Kay (1994). It is convenient to rewrite the objective function (see equation 3.2) in the following form, F = H(X) − ψ1 H(X|R) − ψ2 H(X|C) − ψ3 H(X|R, C),

(A.1)

Functions, Goals, and Rules for Local Processors

909

where ψ1 = 1 − φ2 , ψ2 = 1 − φ1 , and ψ3 = φ1 + φ2 − φ3 − 1. Biases are accommodated by the usual practice of introducing additional inputs clamped at −1. We require the partial derivatives of F taken with respect to w and v. Recall from equation 1.1 that the output unit is bipolar and that the conditional probability that the output is +1 is given by a logistic nonlinearity. It follows that the conditional output entropy is given by H(X|R, C) = −hp log p + (1 − p) log(1 − p)iR,C ,

(A.2)

where the output probability p = 1/(1 + exp(−A(sr , sc )) and the activation function A is that defined in equation 2.1. The notation h· · ·iR,C denotes the operation of taking the average with respect to the joint distribution of R and C. It follows that ¿µ ¶ À p ∂A ∂H(X|R, C) = − log p(1 − p) R (A.3) ∂w (1 − p) ∂sr R,C ∂H(X|R, C) =− ∂v

¿µ

¶ À p ∂A log p(1 − p) C . (1 − p) ∂sc R,C

(A.4)

In order to deal with the other entropy terms in equation A.1, we require expressions for the marginal probability that X = 1 as well as the conditional probabilities Pr(X = 1 | R = r) and Pr(X = 1 | C = c). Given the simplicity of the binary output case, these are the averages of p taken, respectively, over the joint distribution of R and C, the conditional distribution of C given that R = r and the conditional distribution of R given that C = c. These terms are denoted, respectively, by E, ER , and EC . Now using result A.2, applying differentiation to yield results similar to equations A.3 and A.4, and collecting terms gives equations 4.1 and 4.2. References Aitchison, J., & Dunsmore, I. R. (1975). Statistical prediction analysis. Cambridge: Cambridge University Press. Artola, A., Brocher, S., & Singer, W. (1990). Different voltage-dependent thresholds for the induction of long-term depression and long-term potientiation in slices of the rat visual cortex. Nature (London), 347, 69–72. Atick, J. J., & Redlich, A. N. (1993). Convergent algorithm for sensory receptive field development. Neural Comp., 5, 45–60. Barlow, H. B. (1961). Possible principles underlying the transformations of sensory messages. In W. A. Rosenblith (Ed.), Sensory Communication (pp. 217– 234). Cambridge, MA: MIT Press. Becker, S., & Hinton, G. E. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature (London), 355, 161–163.

910

Jim Kay and W. A. Phillips

Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuronal selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Brown, J. J., Fraser, R., Lever, A. F., & Robertson, J. I. S. (1971). Abstracts of World Medicine, 45, 549–644. Eckhorn, R., Reitboeck, H. J., Arndt, M., & Dicke, P. (1990). Feature linking among distributed assemblies: Simulations and results from cat visual cortex. Neural Comp., 2, 293–306. Engel, A. K., Konig, ¨ P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992). Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends in Neuroscience, 15, 218–226. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188. Hancock, P. J. B., Smith, L. S., & Phillips, W. A. (1991). Neural Comp., 3, 201–212. Hirsch, J. A., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. J. Neurosci., 11, 1800–1809. Hirsch, J. A., & Gilbert, C. D. (1993). Long-term changes in synaptic strength along specific intrinsic pathways in the cat visual cortex. J. Physiol., 461, 247– 262. Hotelling, H. (1936). Relation between two sets of variates. Biometrika, 28, 321– 377. Hummel, J. E., & Biederman, I. (1992). Dynamic binding in a neural network for shape recognition. Psych. Rev., 99, 480–517. Kay, J. (1992). Feature discovery under contextual supervision using mutual information. In Proceedings of the 1992 International Joint Conference on Neural Networks (Baltimore) (Vol. 4, pp. 79–84). Kay, J. (1994). Information-theoretic neural networks for the contextual guidance of learning and processing: Mathematical and statistical considerations (Tech. Rep.) Aberdeen, UK: Biomathematics and Statistics Scotland, Macaulay Land Use Research Institute. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21, 105– 117. Lowel, ¨ S., & Singer, W. (1992). Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity. Science, 255, 209–212. Schmidhuber, J., & Prelinger, D. (1993). Discovering predictable classifications. Neural Comp., 5, 625–635. Singer, W. (1990). Search for coherence: A basic principle of cortical selforganization. Concepts in Neuroscience, 1, 1–26.

Received April 21, 1994; accepted August 2, 1996.

Communicated by Peter Dayan

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network David Willshaw Centre for Cognitive Science, University of Edinburgh, Scotland, UK

John Hallam Department of Artificial Intelligence, University of Edinburgh, Scotland, UK

Sarah Gingell Centre for Cognitive Science, University of Edinburgh, Edinburgh, Scotland, UK

Soo Leng Lau Department of Artificial Intelligence, University of Edinburgh, Edinburgh, Scotland, UK

Marr’s proposal for the functioning of the neocortex (Marr, 1970) is the least known of his various theories for specific neural circuitries. He suggested that the neocortex learns by self-organization to extract the structure from the patterns of activity incident upon it. He proposed a feedforward neural network in which the connections to the output cells (identified with the pyramidal cells of the neocortex) are modified by a mechanism of competitive learning. It was intended that each output cell comes to be selective for the input patterns from a different class and is able to respond to new patterns from the same class that have not been seen before. The learning rule that Marr proposed was underspecified, but a logical extension of the basic idea results in a synaptic learning rule in which the total amount of synaptic strength of the connections from each input (“presynaptic”) cell is kept at a constant level. In contrast, conventional competitive learning involves rules of the “postsynaptic” type. The network learns by exploiting the structure that Marr assumed to exist within the ensemble of input patterns. For this case, analysis is possible that extends that carried out by Marr, which was restricted to the binary classification task. This analysis is presented here, together with results from computer simulations of different types of competitive learning mechanisms. The presynaptic mechanism is best known in the computational neuroscience literature. In neural network applications, it may be a more suitable mechanism of competitive learning than those normally considered.

Neural Computation 9, 911–936 (1997)

c 1997 Massachusetts Institute of Technology °

912

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

1 Introduction In the early 1970s, two interlinked theories for the role of the neocortex and the hippocampus in memory storage and retrieval were proposed by Marr (1970, 1971). He suggested that the hippocampus is the site for temporary storage of the raw information received by the brain, over a time scale of typically one day. Periodically this information is transferred to the neocortex, where it undergoes refinement before being stored there in a permanent form. As well as being among the first theories linking the function of these two structures with their underlying anatomy, Marr’s ideas contain several proposals for training neural networks that remain relevant. However, most of his computational neuroscience work has not been in the mainstream of either brain theory or the theory of neural network devices. Previous work (Willshaw & Buckingham, 1990) explicated and extended his theory of the hippocampus (Marr, 1971). In this article we examine and extend his theory of the neocortex, which was concerned with how the cells in the neocortex learn to self-organize their responses to the structured information presented to it (Marr, 1970). Marr proposed that the pyramidal cells of the neocortex extract the structure contained in the patterns of activity impinging on the sense organs by being trained to identify the natural groupings, or classes, into which these patterns fall. For this task, he proposed a feedforward neural network with three layers of cells: input cells, which project to codon cells, which in turn project to an output layer of pyramidal cells. The codon cells are intended to extract the salient features from the input patterns, each active codon cell signaling the presence of a particular feature in an input pattern. Training the network involves presenting a succession of input patterns. At each presentation, the strengths of the connections between codon cells and output cells are modified according to a learning rule that enables the output cell responses to self-organize. Ultimately each output cell comes to be identified with one of the classes of input patterns. Once training is complete, presentation of a new input pattern (which may not have been seen before) from one of the classes causes the output cell that has become identified with that class to respond. How the codon cells are wired up to the input cells so that each responds to the presence of a different feature in an input pattern depends on the precise structure of the input information. Marr assumed that the inputcodon wiring is carried out during a pretraining phase. Once the wiring is set up, it remains fixed, and it is the contacts between codon cells and output cells that are modified during training. The codon layer is found to be essential for several reasons. The number of possible input patterns is assumed to be so large that it is unlikely that any input pattern will be presented more than once. The codon representation allows measures of similarity between patterns to be calculated that can then be used to assign a

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

913

new, unknown pattern to the correct class. Furthermore, the codon layer can be used to infer the presence of features that are missing when incomplete patterns are presented. In addition, the recoding performed by the codon cells makes the individual classes of input patterns more distinguishable than without a recoding. Marr proposed a self-organizing synaptic learning rule, which causes the weight of each synapse between a codon cell and an output cell to come to record the probability that if the pattern presented contains the feature associated with the codon cell, then the pattern is of the class represented by the output cell. Thereafter, on presentation of an input pattern, the mean strength of the synapses measured over the active codon cells is the best estimate (in information theory terms) of the probability that the pattern belongs to the class identified with the output cell. The value of this quantity determines whether the output cell should fire. 1.1 Scope of This Article. Marr addressed separately the problems of how to set the fixed input-to-codon connections and the variable codon-tooutput connections. Here we present our analysis of the second problem. We first show that for the sole case that he considered, that of a single output cell learning to make a binary classification, his proposed scheme has several shortcomings. His specification of the learning rule is unsatisfactory, and some of the neural wiring that he introduces to justify his scheme is both unjustifiable and unnecessary. We describe our proposed solutions to these difficulties. We then describe our extension of the basic scheme, to deal with more complicated classification tasks and to the case of competitive learning among a number of output cells. Preliminary results are contained in Lau (1992) and Gingell (1994). The problem of codon formation will be dealt with in a subsequent publication. 2 Formalism Marr supposed that there is a potentially infinite number of input patterns, or events, which are grouped into a number of different classes. He used as an illustration of a class the set of events that represent various types of poodle. Each single event that represents a different occurrence of a poodle contains certain, but not all, of the features of the poodle class. The assignment of features to events in the same class is highly structured. Two events that represent different poodles have many features in common—many more than the number of features shared by an event representing a poodle with one that does not. Marr referred to such clusters of events over the set of input fibers as mountains. The simplest problem that can be considered is where the network has a single output cell and it has to learn to distinguish between the inputs from two mountains, occurring equiprobably. The result of successful training would be that the output cell would respond to all the members from one mountain and to none from the other. Marr called

914

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

this the spatial recognizer effect because the success of the training regime depends on the fact that, within the space of all events, events in the same class are, on average, closer together (i.e., share more features) than events from different classes. In order to be able to attack the spatial recognizer effect on its own, Marr assumed an idealized input-codon transformation, and he supposed that the input patterns define the following distribution of activity in the codon layer, containing N cells (see Figure 1). Each input event is defined by activity in a selection K out of M cells in the codon layer. K is supposed to be very ¡ ¢ much smaller than M so that the number of possible input patterns, M K , is very large. There are two mountains, and they are distinguishable from one another because the K active codon cells for an event from mountain 1 are chosen from the first M cells in the codon layer, labeled from 1 to M, and the K cells of mountain 2 are chosen from the last M cells, labeled from N − M + 1 to N. M is greater than N − M + 1, and so some codon cells are in both mountains. This was intended to be a simple way of defining structure within potentially large ensembles of events. 3 The Spatial Recognizer Effect Designating the binary valued activity of codon cell i as Ci and the strength of its synapse with the output cell as Wi , the output cell is supposed to fire if the sum (1/K)6i Ci Wi exceeds some threshold value, 2. The problem is to find values for the weights and the threshold such that the output cell responds to events from one mountain only. If each event belongs to one mountain only, the problem is soluble in principle because the two ensembles of events are linearly separable. However, the solution is not straightforward, for two reasons: Each event that occurs may never reoccur and so it is difficult to collect statistics about which event belongs to which class; and in pattern classification terms, these data are supplied unlabeled, in that information about the identity of the mountain to which each event belongs is not available explicitly. This prevents standard supervised training algorithms, such as Perceptron (Minsky & Papert, 1969), from being used. 3.1 Collecting the Evidence. If the classes into which the events fall can be identified, Marr points out that a common method for assigning a newly presented event E to one of the classes Äj is to choose that Äj that maximizes the probability P(E|Äj ) of E belonging to one of the classes {Äj }. This is the maximum likelihood solution (Kingman & Taylor, 1966), and it works best when the Äj can be regarded as random variables and the conditional probabilities P(E|Äj ) are known. However, here an individual event E may occur only once, and so the appropriate values of P(E|Äj ) may not be available. Because of the structure in the data, the probability that a hitherto unknown event E belongs to a particular class will depend on its similarity with other

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

915

A

1

N-M+1

M

N

Mountain 1 Mountain 2

B

Mountain 4

Mountain 1

N

1

2

3

Mountain 2 Mountain 3

Figure 1: How N codon cells are assigned to mountains, each containing M cells. (A) Two-mountain case. (B) Four-mountain case with cyclical boundary conditions.

916

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

events that belong to that class. This requires a computation that is different from that used in maximum likelihood estimation. Similarity between events is defined in terms of the number of features that they have in common. Marr suggested that the purpose of training the network is to record the values of P(Äj |Ci = 1), the probability that the event that contains feature i is of class j, in the weight between the codon cell identified with feature i and the output cell identified with class j. When a new event E is presented for classification, certain codon cells become active, giving K estimates P(Ä|Ci = 1) (written as P(Ä|Ci )) that the event is of type Ä. In Marr’s terminology, this is the evidence that the event is of type Ä, and the problem is how to combine these K estimates. He uses an information theory argument to show that using any combination other than their arithmetic mean increases the information inefficiency (Kullback-Leibler distance) between the true and estimated probability distributions (Sibson, 1969). Therefore, the best estimate of the probability that the presented event E is of class Ä is N X i=1

Ci P(Ä|Ci )/

N X

Ci ,

i=1

P where N i=1 Ci is the number of estimates to be combined. Another way of looking at this is that each estimate P(Ä|Ci ) records the probability that an event, drawn from the set of events that activates codon cell i, belongs to Ä. If an event E is in the set of events that activates exactly those codons concurrently, the best estimate of the probability of that event being in Ä is the minimal entropy combination of the various estimates of P(Ä|Ci ). The expression Marr derived thus represents the probability that an event from the set that jointly activates the codons concerned is in Ä. He writes this probability, perhaps confusingly, P(Ä|E). In effect, Marr is attempting to estimate the probability P(Ä|C1 , C2 , C3 , . . .) from the individual probabilities P(Ä|C1 ), P(Ä|C2 ), P(Ä|C3 ), . . . . This is an underconstrained problem, and his use of information theory arguments could be regarded as a way of imposing extra constraints. Other possible formulations of this estimation problem are discussed in section 8. 3.2 Labeling the Data. These calculations can be carried out only if output cells have been assigned to classes so that it is known which output cell is to be activated by any given input pattern. Marr proposed that this output cell selection problem can be solved by the anatomy of the cortex; for example, for the two-mountain case, one of the codon cells that is active for patterns from just one of the two mountains should be able to drive the output cell directly (in a similar way to a climbing fiber driving a cerebellar Purkinje cell; Marr, 1969). In this way, the output cell would be biased from the start to respond to a proportion K/M of the events from this mountain.

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

917

Marr speculated that the Martinotti cells of the cerebral cortex carry out this function. 3.3 Making the Calculation. If the weight of synapse i holds the value P(Ä|Ci ), a measure of P(Ä|E) can be obtained by summing up the weights on the active codon cells and dividing by the number of active input cells, K. The output cell is then placed in the active state if P(Ä|E) exceeds some threshold value 2, which represents the minimum convincing evidence for E being in Ä; or when (1/K)

N X

P(Ä|Ci ) > 2.

i=1

Given that output cells have been assigned to classes, to realize this procedure it is required to (1) store conditional probabilities P(Ä|Ci ) in the weights of the codon-output cell synapses, (2) calculate the number, K, of active codon cells, and (3) set the threshold on the output cell. To store conditional probabilities, it is required to count the number of occasions, N1, on which codon cell i is active and the number of occasions, N2, on which it and the output cell representing the particular class are jointly active so that the ratio N2/N1 can be formed. Marr suggested (without giving a method) that these numbers can be computed incrementally. He suggested that the threshold should be set so that the output cell fires the correct proportion of times. For the two-mountain case, he showed analytically that the use of these procedures results in the output cell learning to fire in response to all the events belonging to the mountain specified by the climbing fiber cell and to none of the events from the other mountain. 4 Preliminary Investigations of Marr’s Scheme 4.1 Batch/Serial Update, Climbing Fiber Present/Absent. In preliminary studies, we looked at two possibilities for updating the values of N1 and N2: after presentation of each pattern (serial update) or after a specified number of events had been presented for classification (batch update). Two output selection strategies were examined. Following Marr, we looked at the case where one of the codon cells acted as a climbing fiber. Initially this codon cell had a contact of unit strength to the output cell, and the values of all other synapses were randomly chosen (climbing fiber present). We compared this with climbing fiber absent, where the initial values of all synapses were chosen at random, implying that the mountain to be identified with the output cell would also be chosen at random. This defines four different learning procedures. The threshold was set so that the output cell fired with the correct frequency. It was set either directly (for batch update) or by using a rule for adjusting the threshold incrementally after each pattern presentation (for serial update).

918

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

We carried out simulation runs for the binary classification task. Of the four different cases, the combination of batch update, climbing fiber present was best and serial update, climbing fiber absent was worst. This is unfortunate because it is difficult to imagine that patterns will be presented in batches to the neocortex and that the numbers N1 and N2 are counted separately during training, the ratio N1/N2 then being calculated once training is complete. In addition, climbing fiber present is possible only when the codon cell driving the climbing fiber is unique to just one mountain, which in general cannot be so. We now show how a computationally efficient version of the more biologically plausible method of serial update, climbing fiber absent can be made to function. 4.2 Serial Update, Climbing Fiber Absent. We developed a method for calculating conditional probabilities incrementally that was suggested by Minsky and Papert (1969). They showed that by employing this scheme, the value of the weight between an input cell and an output cell converges to the proportion of times that the output cell also fires when the input cell fires, which is the measure of probability required. Let the codon cell i and the single output cell have binary-valued activities Ci and O, respectively. The rules for changing the weights and the threshold are 1Wi = α(−Wi + Wmax O)Ci ,

(4.1)

12 = α(−2 + 2max O),

(4.2)

where α is a learning constant and Wmax , 2max set the maximum values of Wi , 2. The threshold 2 is treated as a weight from an input unit that is always active. The weight update rule has similarities with other schemes (Rumelhart & Zipser, 1985; Grossberg, 1976), as discussed in section 7. The values of α, Wmax , and 2max have to be specified. Since the weights are intended to record conditional probabilities, it would seem logical to set the value of Wmax to 1. The value of α determines the speed of the process and can be set empirically. Preliminary experiments (Gingell, 1994) showed that using these rules for updating the weights and the threshold, the two-mountain problem can be learned easily by the system. However, performance depended on the value of 2max . Setting it at a low value resulted in the output cell responding to either mountain; with a high value of 2max , it learned to respond to neither. In order to investigate this situation more thoroughly and to provide analysis of the more general case of many mountains, we developed the following formalism.

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

919

5 Analysis of the General Case There are N codon cells and Q mountains. The events belonging to any particular mountain can cause activity in any of M particular codon cells with equal probability. Within a mountain, the events occur with equal probability. To determine which codon cells are identified with which mountains, imagine that the N codon cells are arranged around a circle. Mountain 1 is identified with codon cells 1, 2, 3, . . . , M, mountain 2 with cells 1 + N/Q, 2 + N/Q, 3 + N/Q, . . . , M + N/Q, and so on. In this way the codon cells for each mountain are arranged regularly around the circle, the block of codon cells for one mountain being shifted along the circle by N/Q codon cells to give the block for the next mountain (see Figure 1). Since each of the Q mountains is identified with M codon cells, a single codon cell will be active for the events from R different mountains, where R = MQ/N.

(5.1)

The other important parameter is K, the number of codon cells active per input event, which is assumed constant. It is useful to introduce the parameter ² by the relationship K=

(R − 1)N (1 + ²). Q

(5.2)

² is positive and measures the amount by which the parameter K exceeds the minimum value it must have in order for each input pattern to be unique to a mountain. By investigating the conditions under which the events from certain mountains will cause the output cell to fire and those from other mountains prevent it from firing, we show how the value of 2max determines the number of mountains that activate the output cell. 5.1 A Single Mountain. The analysis establishes the conditions under which events from a certain number of different mountains can activate the output cell once each weight value has come to record the proportion of occasions on which the output cell fires when the appropriate codon cell fires (Minsky & Papert, 1969). Since each codon cell is identified with R mountains occurring equiprobably, if a single mountain causes the output cell to fire, each weight from codon cells in that mountain will have value 1/R, and all other weights will have value 0. Similarly, if only one of the Q mountains activates the output cell, the threshold will have a value 2max /Q. Therefore for events from the chosen mountain to fire the output cell, 2max 1 > . R Q

920

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

Of the other mountains, the one most likely to fire the output cell is the one that shares most codon cells with the given mountain. To prevent events from this other mountain from firing the output cell, µ ¶ N 2max N 1 + 1 − (R − 1) .0 ≤ , (R − 1) Q KR QK Q ¶ µ R−1 N. K2max ≥ R Putting these two inequalities together and expressing K in terms of ² yields the result Q Q ≤ 2max < . R(1 + ²) R

(5.3)

In other words for small ², setting 2max at slightly less than the ratio Q/R will ensure that the events from a single mountain, and only those events, activate the output cell. 5.2 Limits on Parameter Values. The number of mountains, R, in which one codon cell is involved must be necessarily less than the number of mountains to be distinguished, Q. As the value at which the free parameter 2max must be set varies with 1/R (see equation 5.3), for large R the value of 2max that enables just one mountain to activate the output cell will not differ much from the value enabling two mountains to excite the output cell (see section 5.3). In any real system, there will be a limit on the maximum value of R (and thus the maximum value of Q), which will depend on the precision to which 2max can be set. The value of K, the number of codons active per input pattern, must be less than M, the number of codon cells per mountain. It must also exceed a lower limit, to ensure that each event belongs to just one mountain. K can vary within a range of N/Q since M−

N < K < M. Q

Again, for large Q (and R) there is limited freedom in the choice of K. Figure 2 shows the results of some illustrative computer simulations for 90 codon cells and Q = 5 mountains. In both training and testing phases, ¡ ¢ the patterns were chosen from the Q M K possible patterns. Each run was repeated 100 times from the same initial sets of weights but with different training and testing sequences. The value of 2max was varied over the range of values for which it is predicted that just a few mountains can excite the output cell. The range over which one mountain can excite the output cell is between 1.55 and 1.65, which is consistent with that predicted from the theory.

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

Percentage

1; 0.50, 0.09

1.05; 0.45, 0.07 100

100

100

80

80

80

80

60

60

60

60

40

40

40

40

20

20

20

20

0

0

0

0

Percentage

1.25; 0.44, 0.07

1.3; 0.39, 0.09

1.35; 0.35, 0.07

100

100

100

100

80

80

80

80

60

60

60

60

40

40

40

40

20

20

20

20

0

0

0

1.4; 0.33, 0.06

Percentage

1.15; 0.45, 0.04

100

1.2; 0.47, 0.05

1.45; 0.30, 0.04

0 1.5; 0.30, 0.04

1.55; 0.29, 0.03

100

100

100

100

80

80

80

80

60

60

60

60

40

40

40

40

20

20

20

20

0

0

0

0

1.65; 0.25, 0.04

1.6; 0.25, 0.04

Percentage

1.1; 0.45, 0.04

1.7; 0.24, 0.04

1.75; 0.21, 0.04

100

100

100

100

80

80

80

80

60

60

60

60

40

40

40

40

20

20

20

20

0

2 4 Mountain number

0

2 4 Mountain number

0

921

2 4 Mountain number

0

2 4 Mountain number

Figure 2: A single output network that was trained on five mountains. N = 90, R = 3, Q = 5, K = 40, α = 0.01. 2max was varied between 1.0 and 1.75 in steps of 0.05. The initial values of the weights and the threshold, 2, were chosen at random from the uniform distributions in the intervals {0, 1}, {0, 2max }, respectively. Training and testing each involved the presentation of 1000 different input patterns, repeated 100 times from the same initial weight values but with different random selections of input patterns. At the end of each test run, the mountains were renumbered according to the percentage of times the test events from the various mountains excited the output cell, the most active mountain being renumbered 1. The bar charts show the mean and standard deviations of these percentage values for each mountain, calculated over the 100 runs. The three numbers above each plot give the value of 2max and the mean and standard deviation of the actual value of 2.

922

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

5.3 More Than One Mountain. The conditions under which the output cell can respond to more than one mountain are more complicated because there are edge effects to summate. It turns out that in the limit of small ², the output cell responds stably to q particular mountains with 2max bounded from above by µ ¶ (q − 1) Q 1− . 2max < R 2(R − 1) The larger the value of 2max , the fewer mountains will excite the output cell. Note that for q = 1, this expression reduces to 2max < Q/R. For the results shown in Figure 3, there were larger numbers of codon cells (N), mountains (Q), and mountains in which an individual codon cell is involved (R). The dependence of the number of mountains exciting the output cell on the value of 2max can be seen clearly. 5.4 Learning Rule Dynamics. The analysis presented so far provides necessary conditions under which desirable stable states exist but not sufficient conditions that such states are reachable. To answer this question, an analysis of the dynamic behavior of the learning rule is needed. Equations 4.1 and 4.2 define a recursive stochastic estimation procedure for calculating the weights and threshold. Such systems are studied by control engineers, for whom questions of convergence are important. Using the results of Ljung (1975), it is possible to show that, provided that the Heaviside transfer function of the output cell is approximated by a steep sigmoid, the learning rule’s dynamic behavior is equivalent to that of a certain continuous nonlinear dynamical system. The values of weights and threshold reachable by the learning rule are the limit points (if any) of that dynamical system, with probability 1. Further analysis of this equivalent dynamical system, involving demonstrating, for example, applicability of the Cohen-Grossberg theorem (Cohen & Grossberg, 1983) to this case, is underway. 6 More Than One Output Cell: Competitive Learning Marr only ever considered the case of a single output cell, although he did discuss in general terms the need to have several output cells that could learn to distinguish several mountains. This has been discussed by a number of authors under the banner of competitive learning (von der Malsburg, 1973; Grossberg, 1976; Rumelhart & Zipser, 1985). 6.1 Single-Winner Competitive Learning. In one simple case, discussed in particular by Rumelhart and Zipser (1985), each input pattern is to produce activity in a single output cell. In this way, different output cells learn to respond to the different types of input pattern. We considered the case where there are as many output cells as mountains, the intention being that

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

Percentage

0.6; 0.60, 0.00

0.7; 0.70, 0.00 100

100

100

80

80

80

80

60

60

60

60

40

40

40

40

20

20

20

20

0

0

0

0

Percentage

1.1; 0.51, 0.05

1.2; 0.48, 0.04

1.3; 0.45, 0.05

100

100

100

100

80

80

80

80

60

60

60

60

40

40

40

40

20

20

20

20

0

0

0

0

1.4; 0.41, 0.04

Percentage

0.9; 0.51, 0.05

100

1; 0.50, 0.03

1.5; 0.38, 0.05

1.6; 0.32, 0.05

1.7; 0.31, 0.04

100

100

100

100

80

80

80

80

60

60

60

60

40

40

40

40

20

20

20

20

0

0

0

0

1.8; 0.23, 0.05

Percentage

0.8; 0.64, 0.15

1.9; 0.21, 0.04

2; 0.18, 0.04

2.1; 0.15, 0.03

100

100

100

100

80

80

80

80

60

60

60

60

40

40

40

40

20

20

20

20

0

2 4 6 8 10 Mountain number

0

2 4 6 8 10 Mountain number

0

2 4 6 8 10 Mountain number

923

0

2 4 6 8 10 Mountain number

Figure 3: A single-output network trained on 10 mountains. N = 250, Q = 10, R = 5, K = 110, 2max was varied from 0.6 to 2.1 in steps of 0.1. All other parameter values and labeling conventions as in Figure 2.

924

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

Mountain number

1

1.05 5

5

5

4

4

4

4

3

3

3

3

2

2

2

2

1

1

1

1

Mountain number

1.25

1.3

1.35

5

5

5

5

4

4

4

4

3

3

3

3

2

2

2

2

1

1

1

1

1.4

Mountain number

1.15

5

1.2

1.45

1.5

1.55

5

5

5

5

4

4

4

4

3

3

3

3

2

2

2

2

1

1

1

1

1.6

Mountain number

1.1

1.65

1.7

1.75

5

5

5

5

4

4

4

4

3

3

3

3

2

2

2

2

1

1

1

1

2 4 Output cell number

2 4 Output cell number

2 4 Output cell number

2 4 Output cell number

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

925

each output cell would specialize onto the patterns from a different mountain. The activations {Aj : j = 1, 2, 3, . . . , N} in each of the Q output cells that result from presentation of an input pattern are calculated as Aj = (1/K)6i Wij Ci − 2j . The cell with the largest activation is then chosen to be active. In Figure 4, competitive learning is illustrated for 90 codon cells responding to patterns from Q = 5 mountains. This is the multiple output cell version of the single-cell system illustrated in Figure 2. The value of 2max has to be set appropriately because low values lead to some output cells responding to patterns from more than one mountain. The competitive learning scheme used by Rumelhart and Zipser (1985), among others, is very similar, with the following weight and threshold update rules, given here for a single output cell: 1Wi = α(−KWi + Wmax Ci )O 12 = α(−2 + 2max ).

(6.1) (6.2)

As before, α and 2max are adjustable parameters, and Wmax is set at 1. The two sets of competitive learning rules (equations 4.1, 4.2, 6.1, and 6.2) differ in having the roles of Ci and O interchanged. This scheme was applied to the problem used for Figure 4, using exactly the same set of parameters, as shown in Figure 5. 6.2 Distributed Outputs. Another possible use of a multiple output cell system is that all events from the same class of input pattern are to produce the same pattern of activity, distributed over the cells of the output layer. This

Figure 4: Facing page. Training a competitive network using the rule developed in this article. All parameter values as for Figure 2 except that there are Q = 5 output neurons and each network was trained for 2000 steps per run. To form each plot, after each run the percentage of times that patterns from a given mountain excited a given output cell was recorded in the appropriate square of a Q × Q, mountain × output cell, matrix. After the 100 different runs made for each value of 2max , the columns of each matrix were permuted to place the column containing the highest individual entry in position 1, the column with the next highest entry in position 2, and so on. Rows were then permuted to bring the row containing the highest entry to position 1, the next highest to position 2, and so on. Each plot is the summation of the results in the 100 matrices so formed. The color of each square in a plot indicates the mean activity on a gray scale, with pure white indicating 100 percent. Each horizontal bar drawn in the square is two standard deviations in length, being scaled by the width of each square that represents unity. The value of 2max used is shown above each plot.

926

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

means that patterns from different mountains may excite the same output cell. However, examination of the single runs used to generate Figures 2 and 3 revealed that the mountains that excite the same output cell lie adjacent to one another. This is not surprising since adjacent mountains share many codon cells, and therefore input patterns from adjacent mountains will share many weight values. This property of the network will prevent a welldistributed pattern of activity from being produced among the output cells. We carried out another set of simulations with a single output cell, in which codon cells were assigned to mountains at random rather than according to their position around the circle. The number of codon cells shared between events from mountains now varies randomly, about a mean value of M2 /N. We repeated the runs shown in Figures 2 and 3 and found that in this case too, the number of mountains exciting the output cell can be varied by adjusting the maximum threshold. When more than one mountain excites the output cell, the codon cells of these mountains are no longer correlated. When the system is given multiple output cells, each class of inputs will excite a different population of output cells, now randomly distributed. This procedure carries out more than just a simple recoding of input information since all the input patterns in a given class, each of which excites a different set of codon cells, are mapped onto the same output pattern. The theory and results for this case will be presented in a subsequent publication. 7 Competition and Competitive Learning There is a straightforward relation between the two competitive learning rules discussed that we shall formulate in terms of the fundamentals of competition. In a competitive learning algorithm, there has to be something for which the elements of the network compete (Prestige & Willshaw, 1975). Usually the competition is brought about by maintaining the total amount of synaptic weight per cell at a constant level. A basic distinction between competitive learning algorithms is that in some cases there is conservation of the sum of the weights of the synapses from each input cell; in other cases, the sum of the weights from each output cell is conserved. To clarify this distinction, we use the formulation of van der Malsburg and Willshaw (1981), developed in a neurobiological context. They separated out the activity-dependent change to be applied to the weights from that ensuring conservation of the summed weights. For input activity Ii and output activity Oj at synapse (i, j) with current weight value Wij , the raw weight change has the familiar Hebbian form (Hebb, 1949), δWij = αIi Oj , where α is a small positive constant. The weights are then normalized to keep their sum (calculated over some

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

Mountain number

1

1.05 5

5

5

4

4

4

4

3

3

3

3

2

2

2

2

1

1

1

1

Mountain number

1.25

1.3

1.35

5

5

5

5

4

4

4

4

3

3

3

3

2

2

2

2

1

1

1

1

1.4

Mountain number

1.15

5

1.2

1.45

1.5

1.55

5

5

5

5

4

4

4

4

3

3

3

3

2

2

2

2

1

1

1

1

1.6

Mountain number

1.1

1.65

1.7

1.75

5

5

5

5

4

4

4

4

3

3

3

3

2

2

2

2

1

1

1

1

2 4 Output cell number

2 4 Output cell number

2 4 Output cell number

927

2 4 Output cell number

Figure 5: Training a competitive network using the learning rule due to Rumelhart and Zipser (1985). All parameter values as for Figure 4.

928

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

as-yet unspecified population of synapses) at a constant level. The new weight Wij∗ is then Wij∗ = (Wij + δWij )

6Wij . 6(Wij + δWij )

Substituting for δWij and defining 1Wij ≡ Wij∗ − Wij , for α ¿ 1, µ ¶ Wij 6Ii Oj . 1Wij = α Ii Oj − 6Wij

(7.1)

7.1 Postsynaptic Sum Rule. In equation 7.1, take the sum over the synapses to each output cell, and maintain this sum at unity. With K active input cells per pattern, 6i Ii Oj = KOj , and therefore 1Wij = α(−KWij + Ii )Oj . This rule was used by Grossberg (1976) in his competitive learning rule, INSTAR. Rumelhart and Zipser (1985) used it in a later paper, with 2max = 0; variable threshold schemes are also known in neurobiological applications (Bienenstock, Cooper, & Munro, 1982). 7.2 Presynaptic Sum Rule. Alternatively, take the sum over the synapses on each input cell, and maintain this sum at unity. Since only one output cell is ever active, with unit magnitude, 6j Ii Oj = Ii , and the presynaptic rule examined in this article is obtained, 1Wij = α(−Wij + Oj )Ii . This update rule is used in the OUTSTAR network due to Grossberg (1976), which can be thought of as INSTAR being run backward, the roles of inputs and outputs being interchanged. In an OUTSTAR network, only one unit is active per input pattern, so that the update rule becomes equivalent to the delta rule (Widrow & Hoff, 1960). Successful operation of the presynaptic rule in competitive learning requires appropriate setting of the parameter 2max . The problem is to make an inactive output unit respond before the values of its weights and its threshold have all shrunk to zero. Suppose that after training, one of the units fails to respond to any event. Then one of the other output units will respond to events from two mountains. As a result, each weight to this unit will have value 2/R and its threshold will be 22max /Q. In order to activate the inactive output unit, j, the sum (1/K)6i Wij Ci − 2j for it must exceed that for the active unit. This condition is 2/R − 2max (2/Q) < 0 2max > Q/R.

(7.2)

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

929

Therefore, provided that 2max exceeds a certain value, successful competitive learning will result. This value is that which in the single output cell case ensures that the cell is activated by the events from a single mountain only (see equation 5.3; compare Figures 2 and 4). 7.3 Comparison of Rules. The postsynaptic sum rule has been cited widely in the literature as the archetypal competitive learning rule. However, as Rumelhart and Zipser (1985), among others, have pointed out, it has a number of shortcomings. The form of this conservation rule ensures that the weights of an output unit are changed independently of the weights to all other output units. This has two important consequences. First, the weights of inactive units never get modified, and so these units do not take part in the competition at all. Rumelhart and Zipser (1985) described two modifications to the rule to deal with this problem—one called “leaky learning” and one involving a threshold placed on each output unit. Second, the algorithm may approach its final state only very slowly. If initially an output unit responds very infrequently, the changes in its weights will be accumulated gradually, and it may be some time before they are enough to make the unit a frequent winner. The result of these two effects is that initially one or two output cells come to respond to all the input patterns (see Figure 7; 250 steps). Dominance then shifts abruptly between cells (500, 750 steps) until a stable configuration results. The development of the mapping according to the presynaptic rule is much smoother (see Figure 6). Intuitively, the presynaptic rule seems more competitive, since at every learning step the weights of all output units are changed. For successful operation, the correct value of the parameter 2max has to be chosen. A different distinction between weight normalization rules has been drawn by Goodhill and Barrow (1994) and Miller and MacKay (1994), who distinguish between subtractive and divisive normalization. Both the presynaptic and postsynaptic rules discussed here are of the divisive type; in a subtractive rule, the value of the weight change 1Wij does not depend on Wij explicitly. The presynaptic-postsynaptic distinction can be used as a basis of comparison between the topology-preserving mapping (Kohonen, 1982) and elastic net (Durbin & Willshaw, 1987) approaches to combinatorial optimization. Kohonen’s method uses a weight update rule of the postsynaptic type; in the elastic net, the sum of all the attractive forces applied between each fixed (input) point and the mobile points on the elastic net is kept constant—a type of presynaptic rule. Both presynaptic and postsynaptic rules are used extensively in models for the development of connections in the nervous system. Both have natural interpretations in terms of the conservation of some type of resource available to the appropriate cells. Which rule is used depends on which is more suitable computationally and which fits better to the experimental results. For the development of the ordered retinotectal projection in

930

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

Mountain number

1; 0.76, 0.73

250; 0.79, 0.59

5

5

5

4

4

4

3

3

3

2

2

2

1

1

1

Mountain number

750; 1.32, 0.38

1000; 1.50, 0.28

1250; 0.49, 0.07

5

5

5

4

4

4

3

3

3

2

2

2

1

1

1

1500; 0.41, 0.03

Mountain number

500; 1.08, 0.51

1750; 0.40, 0.05

2000; 0.40, 0.04

5

5

5

4

4

4

3

3

3

2

2

2

1

1

1

1 2 3 4 5 Output cell number

1 2 3 4 5 Output cell number

1 2 3 4 5 Output cell number

Figure 6: Single run of a “presynaptic” competitive network. Snapshots of a single run of the type used to generate Figure 4. 2max = 2.0; all other parameter values as in Figure 4. The figures above each individual plot specify how many steps in the run had elapsed when the snapshot was taken and the mean and standard deviation of the threshold 2 (calculated over the five output units) at that point.

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

Mountain number

1; 0.76, 0.73

250; 1.59, 0.18 5

5

4

4

4

3

3

3

2

2

2

1

1

1

Mountain number

1000; 2.00, 0.00

1250; 2.00, 0.00

5

5

5

4

4

4

3

3

3

2

2

2

1

1

1

1500; 2.00, 0.00

Mountain number

500; 1.88, 0.07

5

750; 1.98, 0.02

1750; 2.00, 0.00

2000; 2.00, 0.00

5

5

5

4

4

4

3

3

3

2

2

2

1

1

1

1 2 3 4 5 Output cell number

1 2 3 4 5 Output cell number

931

1 2 3 4 5 Output cell number

Figure 7: Single run of a “postsynaptic” competitive network. The run shown in Figure 6 was repeated using the algorithm of Rumelhart and Zipser (1985), with identical parameter values and sequences of patterns chosen during training and testing.

932

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

vertebrates, rules of both types have been used in separate models (Willshaw & von der Malsburg, 1976, 1979). The model due to Swindale (1980) on the development of ocular dominance stripes has a fixed number of receptors per postsynaptic cell, whereas in the models due to Goodhill and Willshaw (1990) and to Miller, Keller, and Stryker (1989), the total amount of presynaptic strength is conserved. For the elimination of synapses in the developing neuromuscular system, there are models incorporating either a presynaptic sum rule (Willshaw, 1981) or a postsynaptic rule (Gouz´e, Lasry, & Changeux, 1983). Recent theoretical work in this neurobiological application has emphasized the need, on biological grounds, for both rules to act simultaneously (Bennett & Robinson, 1989; Rasmussen & Willshaw, 1993). 8 Discussion Marr’s algorithm is a procedure that learns the discriminations among input patterns rather than one that learns to classify; it computes Probability(Output|Input) rather than Probability(Input|Output). Viewed as a pattern recognition algorithm, it is distinctive in that it works with unlabeled data (i.e., the class membership of each input pattern is unknown), and information about the absence of features in the presented pattern is not exploited. 8.1 Labeled versus Unlabeled Data. As an algorithm working on unlabeled data, it is of the unsupervised type. Because of its close relation to the “postsynaptic” algorithm due to Rumelhart and Zipser (1985), the statistical method that it most resembles is the K-means clustering algorithm (Moody & Darken, 1989). If the data were labeled, maximum likelihood schemes could be used, as discussed in section 3.1. Minsky and Papert (1969) showed that it is possible to construct a single-layer neural network that computes maximum likelihood. Given statistical independence among codon cell activities, it can be shown that computing maximum likelihood corresponds to carrying out the linear decision policy of the type that underlies many computations in neural networks. It turns out that the weight values are functions of the probability that the output is active when a codon cell is active as well as the probability that an output is active when a codon cell is inactive, which the presynaptic rule developed here does not exploit. The expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is applicable to both labeled and unlabeled data. It is worth noting that K-means can be regarded as a type of EM optimization of a Gaussian mixture model (Bishop, 1995). 8.2 Other Approaches. Marr concentrates on the problem of estimating the probability P(Output = 1|E) that the output cell should be active on presentation of a pattern of input activity representing the event E. Within this framework of estimation, where information about the states

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

933

of individual input cells is available, there are four pieces of information that could be used: P(Output = 1|Input = 1), P(Output = 1|Input = 0), P(Output = 0|Input = 1), and P(Output = 0|Input = 0). In Marr’s scheme, only the first of these pieces of information is recorded. One reason for doing this may have been that if each input cell is intended to represent a neuron, recording additional information would imply that a neuron has more than one information-bearing state. One can construct more complex neural models by, for example, duplicating input units so that one member of the pair would look after the active state of an input unit and the other member the inactive state. Willshaw (1972) did this in his Inductive Net, and there are interesting mathematical consequences. However, analysis of this radically different type of system would take us too far away from the focus of this article. Even if a scheme is adopted that uses information about these four quantities, the value of P(Output = 1|E) is still not fully determined. In the broader context, this problem could be solved by supposing that the information about the relative frequencies of occurrence of combinations of active input cells can be exploited (Minsky & Papert, 1969). Marr may have been thinking of this possibility because in his complete model, the codon cells are intended to act as hidden units that respond to the salient features of the input patterns rather than to the raw input data. However, his analysis was restricted to the simpler model without hidden units. 8.3 Structured Input Spaces: Mountains. Marr looked at one particular way of imposing structure within a set of input patterns, to define a set of “mountains.” He examined the special case of designing a network that can be trained to distinguish between the patterns from two mountains only, where his proposed scheme of using the equivalent of cerebellar climbing fibers to drive particular output cells does work. However, this is a special and somewhat trivial case, and our analysis generalizes to the cases of more than two mountains, where each codon cell is involved in more than two mountains, and more than one output cell. There are many ways of imposing structure on the input patterns. We have restricted ourselves to looking at Marr’s original prescription and the most straightforward generalizations of it. 9 Conclusions Our most important result is that a neurobiologically plausible implementation of the spatial recognizer effect due to David Marr has been found and can be made to function extremely reliably. The learning rule is plausible for two reasons: It is specified in terms of simple rules for the adjustment of weights on an incremental basis using only information that is locally available; and for its operation, it does not presuppose the existence of intricate circuitry, such as the equivalent of cerebellar climbing fibers, for which there

934

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

is no real evidence. Our analysis confirms our simulation results that the number of mountains that will fire the output cell can be set by adjusting the maximum value of the threshold. The more mountains that are required, the lower the maximum threshold must be. The competitive learning algorithm that we developed for the case of multiple output cells has strong relationships with existing competitive learning rules and is better known in computational neuroscience modeling circles than it is in the field of neural networks. In a forthcoming publication we shall examine more closely the relationship between presynaptic and postsynaptic competitive learning rules, and we shall also show how the learning algorithms that we have developed can be applied to classification tasks where other types of structure within ensembles of input patterns are considered. Two cases of interest are when the number of input units active per event, K, varies over the ensemble of events (mentioned briefly in section 6.2) and when the events from the different mountains can occur with different probabilities.

Acknowledgments We thank Peter Dayan, Martin Simmen, and And Turken for very helpful discussions and our referees for their critical comments. Financial support is acknowledged to the MRC (Programme Grant to D.W.) and to SERC (M.Sc. studentship for S.G.). Facilities were provided by the University of Edinburgh.

References Bennett, M. R., & Robinson, J. (1989). Growth and elimination of nerve terminals at synaptic sites during polyneuronal innervation of muscle cells: A trophic hypothesis. Proceedings of the Royal Society of London B, 235, 299–320. Bienenstock, E., Cooper, L., & Munro, P. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man and Cybernetics, 13, 815–826. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691.

Marr’s Theory of the Neocortex as a Self-Organizing Neural Network

935

Gingell, S. (1994). Extension of Lau’s 1992 simulation of Marr’s theory of the neocortex. Unpublished master’s thesis, University of Edinburgh. Goodhill, G. J., & Barrow, H. G. (1994). The role of weight normalization in competitive learning. Neural Computation, 6, 255–269. Goodhill, G. J., & Willshaw, D. J. (1990). Application of the elastic net algorithm to the formation of ocular dominance stripes. Network, 1, 41–59. Gouz´e, J. L., Lasry, J. M., & Changeux, J. P. (1983). Selective stabilization of muscle innervation during development: A mathematical model. Biological Cybernetics, 46, 207–215. Grossberg, S. (1976). Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121–134. Hebb, D. (1949). The organization of behavior. New York: Wiley. Kingman, J. F. C., & Taylor, S. J. (1966). Introduction to measure and probability. Cambridge: Cambridge University Press. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Lau, S. L. (1992). Analysis of Marr’s model of the neocortex. Unpublished master’s thesis, University of Edinburgh. Ljung, L. (1975). Theorems for the asymptotic analysis of recursive stochastic algorithms (Tech. Rep. No. 7522). Lund, Sweden: Lund Institute of Technology. Marr, D. (1969). A theory of cerebellar cortex. Journal of Physiology (London), 202, 437–470. Marr, D. (1970). A theory for cerebral neocortex. Proceedings of the Royal Society of London B, 176, 161–234. Marr, D. (1971). Simple memory: A theory for archicortex. Philosophical Transactions of the Royal Society of London B, 262, 23–81. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6, 100–126. Minsky, M., & Papert, S. (1969). Perceptrons. Cambridge, MA: MIT Press. Moody, J., & Darken, C. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Prestige, M. C., & Willshaw, D. J. (1975). On a role for competition in the formation of patterned neural connexions. Proceedings of the Royal Society of London B, 190, 77–98. Rasmussen, C. E., & Willshaw, D. J. (1993). Presynaptic and postsynaptic competition in models for the development of neuromuscular connections. Biological Cybernetics, 68, 409–419. Rumelhart, D., & Zipser, D. (1985). Feature discovery by competitive learning. Cognitive Science, 9, 75–112. Sibson, R. (1969). Information radius. Wahrscheinlichkeitstheorie, 14, 149–160. Swindale, N. V. (1980). A model for the formation of ocular dominance stripes. Proceedings of the Royal Society of London B, 208, 243–264. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100.

936

David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau

von der Malsburg, C., & Willshaw, D. J. (1981). Differential equations for the development of topological nerve fibre projections. SIAM-AMS Proceedings, 13, 39–47. Widrow, B., & Hoff, M. (1960). Adaptive switching circuits. In 1960 IRE WESCON Convention Record (Vol. 4, pp. 96–104). New York: IRE. Willshaw, D. J. (1972). A simple network capable of inductive generalisation. Proceedings of the Royal Society of London B, 182, 233–247. Willshaw, D. J. (1981). The establishment and the subsequent elimination of polyneural innervation of developing muscle: theoretical considerations. Proceedings of the Royal Society of London B, 212, 233–252. Willshaw, D. J., & Buckingham, J. T. (1990). An assessment of Marr’s theory of the hippocampus as a temporary memory store. Proceedings of the Royal Society of London B, 329, 205–215. Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London B, 194, 431–445. Willshaw, D. J., & von der Malsburg, C. (1979). A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem. Philosophical Transactions of the Royal Society of London B, 287, 203–243.

Received November 10, 1995; accepted July 11, 1996.

ARTICLE

Communicated by Terrence Sanger

Hybrid Learning of Mapping and Its Jacobian in Multilayer Neural Networks Jeong-Woo Lee Machine Control Laboratory, Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology, Taejeon, Republic of Korea

Jun-Ho Oh Department of Mechanical Engineering, KAIST, Taejeon, Republic of Korea

There are some learning problems for which a priori information, such as the Jacobian of mapping, is available in addition to input-output examples. This kind of information can be beneficial in neural network learning if it can be embedded into the network. This article is concerned with the method for learning the mapping and available Jacobian simultaneously. The basic idea is to minimize the cost function, which is composed of the mapping error and the Jacobian error. Prior to developing the Jacobian learning rule, we develop an explicit and general method for computing the Jacobian of the neural network using the parameters computed in error backpropagation. Then the Jacobian learning rule is derived. The method is simpler and more general than tangent propagation (Simard, Victorri, Le Cun, & Denker, 1992). Through hybridization of the error backpropagation and the Jacobian learning, the hybrid learning algorithm is presented. The method shows good performance in accelerating the learning speed and improving generalization. And through computer experiments, it is shown that using the Jacobian synthesized from noise-corrupted data can accelerate learning speed. 1 Introduction Multilayer feedforward neural networks can be trained to learn a continuous function with desired accuracy (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989). Since the publication of Rumelhart, Hinton, & Williams (1986), many improved learning algorithms have been reported in the literatures, and various learning methods have been reviewed (Ergezinger & Thomsen, 1995; Jin, Nikiforuk, & Gupta, 1995; Baba, Mogami, Shiraishi, & Yamashita, 1995). The multilayer neural network, with various squashing functions, can approximate even first-order derivatives (Gallant & White, 1992; Hornik et al., 1990; Cardaliaguet & Euvrard, 1992). The method of learning first-order derivatives has not been clearly explored, however, and among the various types of learning methods, the error backpropagation Neural Computation 9, 937–958 (1997)

c 1997 Massachusetts Institute of Technology °

938

Jeong-Woo Lee and Jun-Ho Oh

learning rule (delta learning rule) and its variants are widely accepted for their simplicity and relatively good performance in training multilayer neural networks. Consider the following mapping: 9 : X 7→ Y

X ∈ RK and Y ∈ RJ .

(1.1)

Assume that we are to train the neural network with a sufficient number of mapping examples taken from 9. Suppose further that we have additional a priori information on the Jacobian of 9 even for a small number of the mapping examples: 8 : X 7→

∂Y ∂X

X ∈ RK and

∂Y ∂X

∈ RK×J .

(1.2)

In other words, suppose that we have some Jacobian at some data points, which is very important information on 9. To see the effect of Jacobian learning on neural network properties, consider single input–single output case. Suppose that sufficient learning data (xi , 9(xi )) and validation data (xi + δx, 9(xi + δx)) are provided (i = 1, 2, . . . , n). Then the learning error(JL ) and the validation error(JV ) can be expressed as: JL =

n 1X ˆ i ))2 (9(xi ) − 9(x n i=1

(1.3)

JV =

n 1X ˆ i + δx))2 (9(xi + δx) − 9(x n i=1

(1.4)

ˆ where 9(x) denotes neural network realization of 9(x). In this case, if the neural network learned the plant with best generalization, then it is natural that we expect JL ' JV . Therefore, although various measures are suggested and investigated (Niyogi & Girosi, 1996; Freeman & Saad, 1995), as candidates for generalization measures, we suggest using JG = n1 (JL − JV )2 , which can be expanded as n 1X ˆ i ))2 − (9(xi + δx) − 9(x ˆ i + δx))2 )2 ((9(xi ) − 9(x n i=1 !2 Ã n ˆ i )2 ¯¯ 1X ∂(9(x) − 9(x ' δx ¯ x=xi n i=1 ∂x Ã !2 n ¯ ˆ 4X ∂ 9(x) ∂9(x) ¯¯ ¯ 2 ˆ i )) = (9(xi ) − 9(x − δx2 . ¯ ¯ {z } n i=1 | ∂x x=xi ∂x x=xi term−A | {z }

JG =

term−B

(1.5)

Mapping and Its Jacobian

939

In this equation, term-A is related to the mapping error and term-B to the Jacobian error. If term-A is small but term-B remains large after learning, we cannot expect a good generalization because JG remains large. Although this analysis is rough, we can expect that better generalization can be obtained through additional Jacobian learning, which minimizes term-B in equation 1.5. Furthermore, because Jacobian represents the approximate inclination of mapping near a given point, we can expect that Jacobian learning can have the effect of learning several examples near the point, and thus we can expect fast learning speed through Jacobian learning. This article focuses on the following subjects: • A method of computing the Jacobian in a multilayer neural network. Computing the Jacobian in an efficient and explicit way is important because it is used to derive the Jacobian learning rule. • A method of learning the Jacobian in multilayer neural networks to use the available Jacobian for neural network learning. • An investigation of the effect of learning the Jacobian through computer experiments. In section 2, we derive the method of computing the Jacobian in terms of neural network parameters, which becomes very efficient if we use the error backpropagation learning rule. The derived explicit equation is used to derive the Jacobian learning rule. In section 3, we derive the Jacobian learning rule that minimizes the Jacobian error between the Jacobian of the neural network and the Jacobian provided. In section 4, we develop a hybrid learning method that minimizes the cost of the mapping error and the Jacobian error. In section 5, we present some computer experiments to demonstrate the effect of Jacobian learning. And finally, in section 6, we draw some conclusions and discuss some related topics. 2 Computing the Jacobian of a Multilayer Neural Network In this section, we derive the method for computing the Jacobian of a multilayer neural network while learning the mapping by error backpropagation. There has been some evidence that the Jacobian of a multilayer neural network can be calculated using the delta and other related quantities (Jordan & Rumelhart, 1992; White & Jordan, 1992; Iwata & Kitamura, 1994). We will summarize and arrange the method in a formal and explicit way. The equations for computing the Jacobian will be used for deriving the Jacobian learning rule. We begin by explaining the notations we use throughout the article. See Figure 1 for a simplified structure of the neural network. Let ˆ and Y denote the neural network input vector, the neural network X, Y, output vector, and the output vector of the underlying mapping from X,

940

Jeong-Woo Lee and Jun-Ho Oh

Figure 1: Simplified structure of the multilayer neural network. All layers with two nodes are assumed.

respectively. We will use the following notations: X ≡ [x1 , x2 , . . . , xK ]t ¤t £ Y ≡ y1 , y2 , . . . , yJ £ ¤t Yˆ ≡ yˆ 1 , yˆ 2 , . . . , yˆ J

(2.1)

yˆ i =

(2.4)

Op ≡ p oj p

=

fj ≡

ϕiL ( fiL h

+

θiL )

(2.2) (2.3)

≡ oLi i p t

p p o1 , o2 , . . . , ohp p p p ϕj ( fj + θj )

X

p−1

ok

(2.5) (2.6)

p

wjk

(2.7)

k

ˆ respectively. (L − where K and J denote the dimension of X and Y (or Y), 1), hp , ϕji , fji , θji , and oji denote the number of hidden layers, number of nodes in the pth layer, the sigmoidal function of node j in layer i, input to the sigmoidal function, and bias and output of node j in layer i, respectively. Superscript t means the transpose of a vector or a matrix. Let 2k and W k denote the bias input vector at the kth layer and the weight matrix in the kth layer, respectively. (The 0th layer denotes the input layer and the Lth layer denotes the output layer in the L layered neural network.) h it 2k ≡ θ1k , θ2k , . . . , θhkk    Wk ≡   

wk11 wk21 .. . wkhi 1

wk12 wk22 .. . ···

wk13 wk23 .. . ···

(2.8) ··· ··· .. . ···

wk1hi−1 wk2hi−1 .. . wkhi hi−1

     

(2.9)

Mapping and Its Jacobian

941

where wjik denotes the weight from neuron i in the (k − 1)th layer to neuron j in the kth layer. The weight in the first layer implicitly means the weights from the input to the first hidden layer. Define the input vector to the hth layer node as h i Fh ≡ f1h , f2h , . . . , fhhh = W h Oh−1 ,

(2.10)

and define the normalized error as e¯q ≡ colq (IJ×J )

(2.11)

where colq (A) denotes the qth column vector taken from the matrix A, and IJ×J denotes the J × J identity matrix. Define the normalized delta as "

0

ϕji e¯q 0 P ϕji m δ¯q(i+1)m wi+1 mj h it i i1 i2 ih ¯ q ≡ δ¯q δ¯q · · · δ¯q i 1

ij δ¯q

=

when layer i is output when layer i is hidden

(2.12) (2.13)

ij where, in δ¯q , i and j denote the layer number and the node number, respecij tively. The subscript q in δ¯q denotes δ i (defined in equation 4.5) calculated j

ij by the delta learning rule when the error is set to e¯q . The terms e¯q and δ¯q are introduced to take the multiple input, multiple output (MIMO) system into account. Now we can state the first theorem, which provides an easy way of computing Jacobian from neural network parameters.

Theorem 1 Computing the Jacobian of a Multilayer Neural Network. Let ˆ defined in equation 2.3, denote the input vector X, defined in equation 2.1, and Y, and the output vector of a multilayer neural network. Then the following relation holds,    ∂ Yˆ = ∂X  

¯ 1 )t W 1 (1 1 ¯ (112 )t W 1 .. . ¯ 1 )t W 1 (1 J

     

(2.14)

where the notations are given above. (See the appendix of this article for the proof.) This theorem is known and used by some authors (Jordan & Rumelhart, 1992; White & Jordan, 1992; Iwata & Kitamura, 1994). Although they address the single-output case, their method can easily be extended to the MIMO

942

Jeong-Woo Lee and Jun-Ho Oh

case. We summarize it to express the formula in an explicit way, because an explicit formula is indispensable for deriving the Jacobian learning rule. It is clear from equations 2.12 and 2.14 that the Jacobian matrix should be computed row by row by setting the corresponding normalized error, which involves setting corresponding output to 1 at a time. If we compute the Jacobian under the online computation of the error backpropagation 0 learning rule, ϕji and ϕji are computed to adjust the weights. Therefore, it is not so burdensome to compute the Jacobian under the online learning by error backpropagation. If the mapping to be learned is the multipleˆ which can be input, single-output (MISO), then δ¯ is identical with δ/y − y, computed at the final stage of error learning. In this case, the theorem can be stated in a simpler way: ∂ yˆ = (11 )t W 1 . ∂X

(2.15)

(In this case, since the output is one, the subscript in 1 is omitted.) The computation of the Jacobian can be simplified to one multiplication of the ¯ 1. weight matrix W 1 and the vector 1 3 Derivation of the Jacobian Learning Rule Simard et al. (1992) showed that for MISO cases, gradient learning combined with error learning can improve learning speed. Their method requires computing tangent propagation. The method cannot handle missing data in Jacobian, because it requires gradient forward propagation and more computation time. In this section, we derive an efficient Jacobian learning rule that is applicable to MIMO cases and does not require gradient forward propagation in MISO cases. The learning rule can be derived by minimizing the following cost function: Jg ≡

1X (gkl − gˆ kl )2 2 kl

where

gkl =

∂yl ∂xk

gˆ kl =

∂ yˆ l . (3.1) ∂xk

To express the Jacobian learning rule in a compact recursive form, define the following quantities:  1 when i = 1 ipq  w1lp when i = 2 (3.2) µkl =  P (i−1)pq i−1 (i−2)0 wlr ϕr when i ≥ 3 r µlr X ik (i+1)m i+1 ¯ dq = δq wmk (3.3) m

In equations 3.2 and 3.3, the subscripts p and q denote that we are working to learn ∂yq /∂xp .

Mapping and Its Jacobian

943

To derive the Jacobian learning rule, we use the steepest descent method, which is also used for backpropagation: 1 ∂(gpq − gˆpq )2 ηg = − 1g wm kl 2 ∂wm kl = ηg

∂ gˆpq (gpq − gˆpq ) . ∂wm kl

(3.4)

m We need only to compute ∂ gˆpq /∂wm kl to derive 1 g wkl . For the case of m = 1 (first layer),

∂ gˆpq ∂w1kp

P

=

∂(

¯1i 1 i δq wip ) ∂w1kp

= δ¯q1k +

∂ δ¯q1k ∂w1kp

w1kp

00 = δ¯q1k + xp ϕk1 w1kp d1k q .

(3.5)

For the case of m = 2, similarly, ∂ gˆpq ∂w2kl

P

= =

¯1i 1 i δq wip ) ∂w2kl P 1 10 P 2r 2 ¯ ∂( i wip ϕi r δq wri ) ∂w2kl ∂(

0 = w1lp ϕl1 δ¯q2k + 0 = w1lp ϕl1 δ¯q2k +

X

0

ϕi1 w1ip

i 1 200 2k ol ϕk dq

∂ δ¯q2k

w2 ∂w2kl ki X 0 ϕi1 w1ip w2ki .

(3.6)

X 0 2 2 (ϕm wmi w3km ).

(3.7)

i

Similarly, for the case of m = 3, ∂ gˆpq ∂w3kl

0 = δ¯q3k ϕl2

X i 00

0

ϕi1 w1ip w2li

+o2l ϕk3 d3k q

X i

0

ϕi1 w1ip

m

We can obtain the weight adaptation rule for the Jacobian learning by substituting equations 3.5 through 3.7 into equation 3.4. For the biases, a similar adaptation rule for Jacobian learning can be obtained. In fact, the Jacobian

944

Jeong-Woo Lee and Jun-Ho Oh

leaning rule, including m = 1, 2, and 3, can be generalized as follows: m−1 m00 mk (m+1)pq ¯mk m0 mpq 1g wm ϕk dq µkl )(gpq − gˆpq ). kl = η g (δq ϕl µkl + ol | {z } | {z }

1g θkm

=

term1 m00 mk (m+1)pq ηg ϕk dq µkl (gpq

(3.8)

term2

− gˆpq ).

(3.9)

In equation 3.8, o0l means the inputs of neural network, xl . The derivation of the Jacobian learning rule for the general case in equations 3.8 and 3.9 can be found in the appendix. In equation 3.8, term1 represents the effect of the weight variation to the Jacobian of the neural network in forward direction. And term2 represents the effect of variation of δ¯ in the mth layer neuron. Because δ¯ affects the Jacobian of the neural network, term2 cannot be neglected. As can be seen in the above Jacobian learning rule, the ith layer of µ in equation 3.2 is computed using their value at the (i−1)th layer, so the direction of Jacobian learning is forward. The terms needed for computing the Jacobian, that is, δ¯ in equations 4.5 and 2.13, are computed in a backward direction. This is the reverse direction to that of error learning. 4 Hybrid Learning of Mapping and Its Jacobian If we set the cost function as follows, we can derive the weight adaptation rule, which minimizes the mapping error and the Jacobian error: J ≡ λe Je + λg Jg 1X Je ≡ (yk − yˆ k )2 . 2 k

where

(4.1) (4.2)

In equation 4.1, λe and λg denote the weighting coefficients for the mapping error and the Jacobian error, which may vary with iteration of learning. Weight correction, 1wm kl , to the direction that minimizes J is assumed to be decomposed as m m 1wm kl = λe 1e wkl + λ g 1 g wkl ,

(4.3)

where 1e wm kl denotes the weight correction to the direction that minimizes Je , and 1g wm kl denotes the weight correction to the direction that minimizes is calculated according to error backpropagation (Rumelhart et al., Jg . 1e wm kl 1986), 1e wm kl = −ηe

∂ Je ∂wm kl

= ηe δkm om−1 , l

(4.4)

Mapping and Its Jacobian

945

Figure 2: Direction of the weight correction. The two-dimensional weights are assumed.

where δk is defined as · δkm =

0

ek ϕkL 0 P ϕkm l δlm+1 wm+1 lk

when neuron k is output, ek = yk − yˆ k (4.5) when neuron k is hidden.

The following steps are required for the hybrid learning of the mapping and its Jacobian: 1. Compute neural network outputs(yˆj ) and mapping errors in the forward direction. 2. Compute δ¯m and 1e w in the backward direction. k

3. Compute Jacobian errors (Eg ) at the end of step 2. 4. Correct weights according to equation 4.3 while computing δ¯ijm and 1g w in the forward direction. The weight corrections λe 1e W and λg 1g W can conflict with each other. As shown in Figure 2, if the angle between λe 1e W and λg 1g W is relatively large and their magnitudes are approximately same, then the resultant weight correction, 1W, can be the direction that can minimize neither Je nor Jg . In this case, tuning the weighting coefficients λe and λg is recommended. If one is to weight the error, then it is better to increase λe relative to λg . But in many cases, the weighting coefficients can be set to constant because, after some iteration of learning, the two directions of the weight correction can approximately coincide with each other. If λe is very small compared with λg , then almost all learning is done for Jacobian learning. In this case, the learned neural network will have a function shape very similar to the underlying mapping, but the mapping error may be large. Clearly the tuning of the two weighting coefficients is important.

946

Jeong-Woo Lee and Jun-Ho Oh

5 Computer Experiments 5.1 Experiment 1: Comparison of Learning Speed. In this experiment, we choose static mapping with one relatively sharp corner as an underlying mapping. The mapping to be learned by the neural network is generated by µ

(x − 0.05)2 y = −13(x − 0.5) + 8 exp − 0.01 2

¶ .

(5.1)

The shape of the mapping to be learned is shown in Figure 3. The neural networks are constructed as follows: • One hidden layer with 60 neurons; bias inputs to hidden and output nodes. • Squashing function: Hyperbolic tangent for hidden layers, linear output node. • Error learning rate, ηe = 0.01; Jacobian learning rate, ηg = 0.001. • Number of iterations for learning: 2000 epochs. For each epoch, 80 samples are presented. • Initial weights: Random number between −1 and 1. • Learning examples: Chosen randomly within the range of −0.3 ≤ x ≤ 1.3. • Weighting coefficients(λe and λg ): Set to unity. No additional techniques are used to accelerate the learning speed. To compare the differences in the learning properties easily, we set the topology of the neural network and the learning rate identical for error backpropagation and hybrid learning. The only difference is whether Jacobian learning is added. The neural network mapping error(Je ) is shown in Figure 4. The top figure is the result of conventional error backpropagation, and the bottom figure is that of hybrid learning. The middle figure is the sum of the squared error(Je ) by assuming that the Jacobian is available on 50 percent of learning data. Careful investigation of the results leads to the following conclusions: • Even we take the computation time of hybrid learning (almost twice that of error backpropagation in this experiment) into consideration, learning is far more accelerated with the additional Jacobian learning. • The result of the hybrid learning improves as more Jacobian information is available. 5.2 Experiment 2: Use of the Jacobian Synthesized from Noise-Corrupted Data. In this experiment, we investigate the possibility of using the

Mapping and Its Jacobian

947

6

4

2

y

0

−2

−4

−6

−8

−10 −0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

Figure 3: Shape of the underlying mapping used in experiments 1 and 2.

Jacobian synthesized from the noise-corrupted data, a similar experiment with a real-world problem. In real data sets, the Jacobian may not be provided, and even the data set may be corrupted by measurement noise. Using the data set in experiment 1, 80 noise-corrupted samples are generated by the following equation: ¶ µ xi − 0.05)2 2 yi = −13(xi − 0.5) + 8 exp − 0.01 + 0.5 × randn i = 1, 2, . . . , 80 xi = −0.3 + 0.2i

i = 1, 2, . . . , 80

where randn denotes normal gaussian noise. The underlying mapping and noise-corrupted data are shown in Figure 5. To extract the rough Jacobian, we use the following simple formula: µ ¶ 1 yi − yi−1 yi+1 − yi = + . (5.2) gi 2 xi − xi−1 xi+1 − xi Because the Jacobian obtained in this way is very noisy, we use the following simple averaging filter: gi | f iltered =

gi−1 + 2gi + gi+1 . 4

(5.3)

Figure 6 shows the added noise (top), the synthesized Jacobian with equation 5.2 (middle panel,) and the filtered Jacobian (bottom panel). Because the

948

Jeong-Woo Lee and Jun-Ho Oh

Network Error(0%)

4

10

3

10

2

Sum−Squared Error

10

1

10

0

10

−1

10

−2

10

−3

10

0

200

400

600

800

1000 Epoch

1200

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

Network Error(50%)

4

10

3

10

2

Sum−Squared Error

10

1

10

0

10

−1

10

−2

10

−3

10

0

200

400

600

800

1000 Epoch

1200

Network Error(100%)

4

10

3

10

2

Sum−Squared Error

10

1

10

0

10

−1

10

−2

10

−3

10

0

200

400

600

800

1000 Epoch

1200

Figure 4: Neural network mapping error (Je ) during 2000 epochs of learning in experiment 1. Top: µe = 0.01, µg = 0. No Jacobian learning. Middle: µe = 0.01, µg = 0.001. Jacobian learning for 50 percent of the examples. Bottom: µe = 0.01, µg = 0.001. Jacobian learning for all examples.

Mapping and Its Jacobian

949

6

4

2

0

−2

−4

−6

−8

−10 −0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Figure 5: Plot of the underlying mapping (smooth curve) and the noisecorrupted data set in experiment 2.

synthesized Jacobian is not the exact Jacobian of the underlying mapping, learning of filtered Jacobian at the fine-tuning stage is undesirable and may lead to an inferior model, so we tune the learning rate as follows: 

0.001  decreased linearly ηg =   from 0.001 to 0. 0 

0.01 ηe =  decreased linearly from 0.001 to 0.

0 < epoch ≤ 500 500 < epoch ≤ 1000 1000 < epoch ≤ 2000

(5.4)

0 < epoch ≤ 1000 1000 < epoch ≤ 2000.

(5.5)

All the other conditions except the learning rate are the same as in experiment 1. The neural network mapping error (Je ) is shown in Figure 7. Although we believe that hybrid learning requires two passes of weight updates, the result shows that the learning is much faster, by three to four times, compared with that of error backpropagation. This experiment indicates that the use of the filtered Jacobian synthesized from the raw data can be effective in accelerating learning; however, the structure of the filter should vary according to the characteristics of the given data set. 5.3 Experiment 3: Suppression of Overfitting. In this experiment, we investigate the ability to prevent overfitting by additional Jacobian learning.

950

Jeong-Woo Lee and Jun-Ho Oh (a) Data Noise 2 1 0 −1 −0.5

0 0.5 1 (b) Extracted Jacobian and Jacobian of underlying mapping

1.5

0 0.5 1 (c) Filtered extracted Jacobian and Jacobian of underlying mapping

1.5

100

0

−100 −0.5 100

0

−100 −0.5

0

0.5

1

1.5

Figure 6: Experiment 2. (a) Noise added to the underlying mapping. (b) Jacobian synthesized from the noise-corrupted data by using equation 5.2. The smooth curve in the figure represents the Jacobian of underlying mapping (see equation 5.1). (c) Filtered Jacobian shown in (b) with simple averaging filter by using equation 5.3. The smooth curves in the figure represent the Jacobian of the underlying mapping (see equation 5.1).

For this purpose, the experiment is designed to fit a straight line with a small number of examples. Experimental conditions were as follows: • For the neural network, there was one hidden layer with 100 neurons of hyperbolic tangent with bias and one linear output node with bias. The error learning rate was 0.0005, and the Jacobian learning rate was 0.0005. There were 20,000 iterations for learning. • Seven sample data and seven validation data, which were noise corrupted, were generated by the following equations: xi = 0.4i + rand

i = 1, 2, . . . , 7

yi = 0.5xi + rand

i = 1, 2, . . . , 7

where rand denotes the random number between −1 and 1. • Learning examples were shown in a random sequence.

Mapping and Its Jacobian

951 Network Error

4

10

Sum−Squared Error

3

10

2

10

1

10

0

100

200

300

400

500 Epoch

600

700

800

900

1000

700

800

900

1000

Network Error

4

10

Sum−Squared Error

3

10

2

10

1

10

0

100

200

300

400

500 Epoch

600

Figure 7: Neural network mapping error (Je ) in experiment 2. Top: Result of conventional error backpropagation. Bottom: Result of the proposed method by using the filtered Jacobian (see equation 5.3) synthesized from the noisecorrupted data.

• For each learning of the sample points in Jacobian learning, the noisecorrupted Jacobian was provided by the following equation: g = 0.5 + randn × 0.3 where randn denotes normal gaussian noise. • The weighting coefficients were set to unity.

(5.6)

952

Jeong-Woo Lee and Jun-Ho Oh

No other modifications were attempted so that we could see the effects of the Jacobian learning more clearly. As in the previous experiment, the same topology of the neural network was used, and the only difference was whether the Jacobian learning was added. The learning error(JL ) and validation error (JV ) versus learning epoch are shown in Figures 8 and 9. The result of neural network learning and the average of the squared distance between the neural network and underlying mapping (y = 0.5x) is also shown. The learning error (JL ) and validation error (JV ) are computed by summing the square of error at the points. The averages of JL and JV in the last 100 iterations for error learning are 0.51 and 4.41, respectively. Those for hybrid learning are 1.88 and 3.38, respectively. The difference between JL and JV in hybrid learning (1.50) is smaller than that in error learning (3.90). A comparison of the two sets of figures shows that the mean distance between the neural network and underlying mapping in error backpropagation is larger than that in hybrid learning. Although the learning error for learning samples in hybrid learning is larger than that in error backpropagation, there is better generalization. These observations support the hypothesis set out in section 1 regarding the effect of Jacobian learning on generalization. 6 Conclusions We have proposed an algorithm that learns the mapping and its Jacobian. For this purpose, we derived the explicit form of computing the Jacobian using the neural network parameters, and we derive the hybrid learning algorithm that minimizes the cost function consisting of mapping error and Jacobian error. The proposed Jacobian learning method is similar to that of the error backpropagation learning rule but in the opposite direction of calculation. Computer experiments proved that the proposed method is faster than conventional error backpropagation, because the proposed hybrid learning algorithm learns to minimize both the mapping error and the Jacobian error simultaneously. We show as well the possibility that using the rough Jacobian synthesized from noise-corrupted raw data set can accelerate learning speed and that Jacobian learning improves generalization through computer experiment. Appendix A.1 Proof of Theorem 1. Proof. From the definition of the feedforward neural network (at output layer), yˆj = ϕjL ( fjL + θjL )

Mapping and Its Jacobian

953

3

1

10

2.5

Mean Squared Distance

2

1.5

y

1

0.5

0

10

−1

10

0

-0.5

-1 -2

−2

-1

0

1

2

3

4

5

6

10

7

0

100

200

300

400 500 600 Learning Epoch

700

800

900

1000

100

200

300

400 500 600 Learning Epoch

700

800

900

1000

x

2

2

10

10

1

1

10

Learning Error

Validation Error

10

0

0

10

10

−1

10

0

−1

100

200

300

400 500 600 Learning Epoch

700

800

900

1000

10

0

Figure 8: Experiment 3: Error backpropagation learning. Upper left: Learned result of neural network. Dotted line: underlying mapping, o: validation data, *: sample learning data. Upper right: Mean squared distance between neural network and underlying mapping (−0.5 < x < 6.5). Lower left: Sum of squared error for learning data. Lower right: Sum of squared error for validation data.

Ã =

X

ϕjL

X

= ϕjL

L L−1 wjk ϕk

k

Ã =

+

θjL

k

Ã

ϕjL

! L L−1 wjk ok

X

³

fkL−1

+

θkL−1

´

! + θjL

ÃÃ L L−1 wjk ϕk

k

X i

! L−2 wL−1 ki oi

! +

θkL−1

! +

θjL

.

to get We differentiate the outputs directly by oL−2 i ∂ yˆj ∂oL−2 i

= ϕjL

0

X L L−10 L−1 (wjk ϕk wki ) ≡ gˆ L−2 ij . k

(A.1)

954

Jeong-Woo Lee and Jun-Ho Oh

3

1

10

2.5

Mean Squared Distance

2

1.5

y

1

0.5

0

10

−1

10

0

-0.5

-1 -2

−2

-1

0

1

2

3

4

5

6

10

7

0

100

200

300

400 500 600 Learning Epoch

700

800

900

1000

100

200

300

400 500 600 Learning Epoch

700

800

900

1000

x

2

2

10

10

1

1

10

Learning Error

Validation Error

10

0

0

10

10

−1

10

0

−1

100

200

300

400 500 600 Learning Epoch

700

800

900

1000

10

0

Figure 9: Experiment 3: Hybrid learning. Upper left: Learned result of neural network. Dotted line: underlying mapping; o: validation data; *: sample learning data. Upper right: Mean squared distance between neural network and underlying mapping (−0.5 < x < 6.5). Lower left: Sum of squared error for learning data. Lower right: Sum of squared error for validation data.

At the output layer, normalized delta is expressed as h it L0 ¯ Lm ≡ 0, . . . , 0, ϕm , 0, . . . , 0 , 1

(A.2)

0

L is the first-order derivative of the activation function of the mth in which ϕm element and the superscript L denotes the output layers. From equation 2.12,

(L−1)k = ϕk(L−1) δ¯m

0

X

Lj L δ¯m wjk

j 0

L L = ϕk(L−1) ϕm wmk 0

(A.3)

Mapping and Its Jacobian

955

and h i (L−1)hL−1 t (L−1)1 ¯ (L−1)2 ¯ L−1 ≡ δ¯m , δm , . . . , δ¯m 1 m h it 0 0 0 L0 L L0 L L0 L = ϕ1(L−1) ϕm wm1 , ϕ2(L−1) ϕm wm2 , . . . , ϕh(L−1) ϕ w , m mh L−1 L−1

(A.4)

where hL−1 is the number of nodes in the (L − 1)th hidden layer. If we apply the theorem, assuming (L − 1) to be the only hidden layer and (L − 2) the input layer, t ¯ L−1 ((W L−1 )t 1 m )  L−1 w11 ··· wL−1 1hL−2  wL−1 · · · · ··  21 t ¯ L−1 = (1 .. .. .. m )  . . .  L−1 · · · w wL−1 h(L−1) 1 hL−1 hL−2 h i ˆ L−2 ˆ L−2 = gˆ L−2 ≡ Gˆ L−2 m 1m , g 2m , . . . , g hL−2 m .

      (A.5)

Equation A.5 denotes the gradient of yˆ m with respect to OL−2 . If we assume that the only hidden layer is (L − 1) and layer (L − 2) is the input layer, equation A.5 is equivalent to the Jacobian of the output vector with respect in equation A.5 using equation A.4, to the input vector. If we rewrite gˆ L−2 ij Ã gˆ L−2 ij

=

0 ϕjL

X

! L L−10 wL−1 ki wjk ϕk

.

(A.6)

k

Equations A.1 and A.6 are identical for all possible choice of j. This proves the theorem when only one hidden layer exists in the neural network. Next, we prove that if ∂ yˆj ∂Ol

¯ l+1 )t W l+1 = (1 j

(A.7)

is true, then ∂ yˆj ∂Ol−1

¯ l )t W l = (1 j

(A.8)

also holds. This means that if the theorem holds for (L − l − 1) hidden layers, from it holds for (L−l) hidden layers. If we extract one component ∂ yˆj /∂ol−1 i equation A.8, ∂ yˆj ∂ol−1 i

=

X k

δ¯jlk wlki

956

Jeong-Woo Lee and Jun-Ho Oh

=

X

Ã 0 ϕkl wlki

X m

k

! δ¯j(l+1)m wl+1 . mk

(A.9)

And if we apply the chain rule and use equations 2.5 through 2.7, ∂ yˆj ∂ol−1 i

=

∂ yˆj ∂Ol ∂Ol ∂ol−1 i

(using equation A.7)

¯ l+1 )t W l+1 = (1 j

∂Ol ∂ol−1 i

.

(A.10)

Furthermore, the following two equations hold:  P

l+1 ¯ (l+1)m m wm1 δj P l+1 ¯ (l+1)m m wm2 δj

   =   P

¯ l+1 )t W l+1 (1 j

.. .

l+1 ¯ (l+1)m m wmhl δj

    .  

h 0 it 0 0 = ϕ1l wl1i , ϕ2l wl2i , . . . , ϕhl l wlhl i .

∂Ol ∂ol−1 i

(A.11)

(A.12)

If we substitute equations A.11 and A.12 into equation A.10, then we can easily see the result is identical with equation (A.9). Thus, by induction, the theorem is proved for any number of hidden layers. A.2 Derivation of Jacobian Learning Rule for General Case. In the following derivation, we use the notation that, in subscript im , m is a variable. For example, if m = 3, subscript im−1 denotes subscript i2 . For any m, we can rewrite gˆpq as X δ¯q1i1 w1i1 p gˆpq = i1

=

X i1

=

0

ϕi11 w1i1 p

X i1

δ¯q2i2 w2i2 i1

······ X 0 X 0 ϕi11 w1i1 p ϕi22 w2i2 i1 · · · i1

X im−1

i2

0 ϕi(m−1) wm−1 im−1 im−2 m−1

X im

δ¯qmim wm im im−1 .

(A.13)

And from equation 2.12, X ∂ δ¯q i−1 i00 δ¯q(i+1)t wi+1 i = ol ϕj tj . ∂wjl t ij

(A.14)

Mapping and Its Jacobian

957

Then we can write: X 0 X 0 X ∂ gˆ 0 (m−1)0 m−1 ¯mk ϕi11 w1i1 p ϕi22 w2i2 i1 · · · ϕi(m−2) wm−2 wlim−2 im−2 im−3 ϕl m = δq m−2 ∂wkl i1 i2 im−2 +

X i1

0

ϕi11 w1i1 p 0

X

00

X

= δ¯qmk ϕl(m−1) +om−1 ϕkm l X im−1

X

i1

i1

0

ϕi22 w2i2 i1 · · ·

i2

0

X

0

X

ϕi11 w1i1 p ϕi11 w1i1 p

i2

i2

0 m ϕi(m−1) wm−1 im−1 im−2 wkim−1 m−1

X im−1

0

m ϕi(m−1) wm−1 im−1 im−2 wkim−1 m−1

0

ϕi22 w2i2 i1 · · ·

X im−2

∂ δ¯qmk ∂wm kl

0

m−1 ϕi(m−2) wm−2 im−2 im−3 wlim−2 m−2

0

ϕi22 w2i2 i1 · · ·

X

δ¯q(m+1)t wm+1 tk

t

0 00 mpq (m+1)pq = δ¯qmk ϕlm µkl + om−1 ϕkm dmk . q µkl l

(A.15)

If we substitute equation A.15 into equation 3.4, we can obtain equation 3.8. Similar derivations can be made for equation 3.9. From equation 2.12, ∂ δ¯q

ij

j ∂θk

00

= ϕji

X

δ¯q(i+1)t wi+1 tj .

(A.16)

t

Using equations A.13 and A.16, we can write: X 0 X 0 ∂ gˆpq ϕi11 w1i1 p ϕi22 w2i2 i1 · · · m = ∂θk i1 i2 P X ∂( im δ¯qmim wm 0 im im−1 ) (m−1) m−1 ϕim−1 wim−1 im−2 m ∂θk im−1 X 0 X 0 1 1 2 2 = ϕi1 wi1 p ϕi2 wi2 i1 · · · i1

X im−1 00

i2

0 m m00 ϕi(m−1) wm−1 im−1 im−2 wkim−1 ϕk m−1

(m+1)pq

= ϕkm dmk q µkl

.

X

δ¯q(m+1)t wm+1 tk

t

(A.17)

If we substitute equation A.17 into equation 3.4, we can obtain equation 3.9. References Baba, N., Mogami, Y., Shiraishi, Y., & Yamashita, Y. (1995). New neural network training algorithm and its applications. Journal of Artificial Neural Networks 2:37–54. Cardaliaguet, P., & Euvrard, G. (1992). Approximation of a function and its derivative with a neural network. Neural Networks 5:207–220.

958

Jeong-Woo Lee and Jun-Ho Oh

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2:303–314. Ergezinger, S., & Thomsen, E. (1995). An accelerated learning algorithm for multilayer perceptrons: Optimization layer by layer. IEEE Trans. Neural Network 6(1):31–42. Freeman, J. A. S. & Saad, D. (1995). Learning and generalization in radial basis function, Neural Computation 7(8):1000–1020. Gallant, A. R., & White, H. (1992). On learning the derivatives of an unknown mapping with multilayer feedforward networks. Neural Networks 5:129–138. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks 2:359–399. Hornik, K., Stinchcombe, M., & White, H. (1990). Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 3:551–560. Iwata, M. & Kitamura, S. (1994). An input tracking control system and an input estimation system using a feedforward model on neural network (in Japanese). Transactions of the Society of Instrument and Control Engineers 30(3):303–309. Jin, L., Nikiforuk, P. N. & Gupta, M. (1995). Fast neural learning and control of discrete-time nonlinear systems. IEEE Trans. Systems, Man and Cybernetics 25(3):478–488. Jordan, M. I., & Rumelhart, D. E. (1992). Forward models: Supervised learning with a distal teacher. Cognitive Science 16:307–354. Niyogi, P., & Girosi, F. (1996). On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Computation 8(4):819–842. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. Parallel Distributed Processing, Exploration in the Microstructure of Cognition. Cambridge, MA: MIT Press. Simard, P., Victorri, M., Le Cun, Y., & Denker, J. (1992). Tangent prop—a formalism for specifying selected invariances in an adaptive network. In J. E. Moody, R. Hanson, & R. P. Lippmann (Eds.), Advances in Neural Information Processing Systems 4 (pp. 895–903). San Mateo, CA: Morgan Kaufmann. White, D. A., & Jordan, M. I. (1992). Optimal control: A foundation for intelligent control. In D. A. White & D. Sofge (Eds.), Handbook of Intelligent Control (pp. 185–214). New York: Van Nostrand.

Received February 6, 1996; accepted September 10, 1996.

NOTE

Communicated by Ken Miller

The Joint Development of Orientation and Ocular Dominance: Role of Constraints Christian Piepenbrock Department of Computer Science, Technical University of Berlin, Berlin, Germany

Helge Ritter Faculty of Technology, University of Bielefeld, Bielefeld, Germany

Klaus Obermayer Department of Computer Science, Technical University of Berlin, Berlin, Germany

Correlation-based learning (CBL) has been suggested as the mechanism that underlies the development of simple-cell receptive fields in the primary visual cortex of cats, including orientation preference (OR) and ocular dominance (OD) (Linsker, 1986; Miller, Keller, & Stryker, 1989). CBL has been applied successfully to the development of OR and OD individually (Miller, Keller, & Stryker, 1989; Miller, 1994; Miyashita & Tanaka, 1991; Erwin, Obermayer, & Schulten, 1995), but the conditions for their joint development have not been studied (but see Erwin & Miller, 1995, for independent work on the same question) in contrast to competitive Hebbian models (Obermayer, Blasdel, & Schulten, 1992). In this article, we provide insight into why this has been the case: OR and OD decouple in symmetric CBL models, and a joint development of OR and OD is possible only in a parameter regime that depends on nonlinear mechanisms. 1 The Correlation-Based Learning Model Following the CBL-approach (Linsker, 1986; MIller, 1990), we consider four populations i = 1, 2, 3, 4 of geniculate “input” cells, which we call left-eyeON, left-eye-OFF, right-eye-ON, and right-eye-OFF, respectively, and one type of cortical cells receiving the afferents. Cells are modeled as connectionist-type neurons and are arranged in two-dimensional layers, where αE , βE denote geniculate locations and xE, yE specify cortical positions (in the same coordinate system). The geniculo-cortical projection is described by synaptic strengths SiαE ,Ex for the connections between geniculate cells (i, αE ) and cortical cells xE. The activity-driven development of SiαE ,Ex is generally described by a Hebbian model (Miller, 1990), X d i ij j i SαE ,Ex (t) = AαE ,Ex IxE,Ey CαE ,βE Sβ,E x, x (t) − ²AαE ,E E y (t) − γ SαE ,E dt E

(1.1)

j,β,E y

Neural Computation 9, 959–970 (1997)

c 1997 Massachusetts Institute of Technology °

960

Christian Piepenbrock, Helge Ritter, and Klaus Obermayer

where: • AαE ,Ex is the localized arbor function, which ensures topography and is established by an intrinsic process before activity-driven processes set in. AαE ,Ex represents the synaptic density between a geniculate neuron at αE (independent of layer i) and a cortical neuron at xE. • IxE,Ey denotes the effective intracortical interactions between neurons at locations xE and yE. This function couples the receptive field development of cortical neurons and thus induces correlations between the receptive field properties of neighboring cells (i.e., cortical map development). ij

• CαE ,βE are the two point correlation functions of geniculate activities at E These functions determine the structural properties of (i, αE ) and (j, β). the developing receptive fields—whether ON/OFF subfields or eye dominance emerge. • The last two terms are necessary to constrain the growth of synaptic weights in Hebbian models. Weights may be normalized by multiplicatively scaling them or by subtracting a constant. Multiplicative (M) constraints are implemented by a nonzero time-dependent γ ; subtractive (S) constraints require a nonzero time-dependent ² (Miller & MacKay, 1994). 2 Analysis of the Model We make two biologically motivated assumptions to characterize the conditions analytically under which OR and OD develop in this model framework. The first generally used assumption is symmetry of the activity correlation of ON and OFF cells and left and right eye. Symmetry leads to only four independent components of the correlation matrix {Cij }: C1 ≡ C11 = C22 = C33 = C44 same eye, same cell type C2 ≡ C12 = C21 = C34 = C43 same eye, different cell type C3 ≡ C13 = C24 = C31 = C42 opposite eye, same cell type

(2.1)

C4 ≡ C14 = C23 = C32 = C41 opposite eye, different cell type . Under this assumption, equation 2.1 decouples into four independent differential equations for the development of the modes, as shown in Table 1. We conclude that if development proceeds purely linearly without bound, the ratio of OR and OD becomes arbitrarily large or small. Thus, either ON/OFF segregation or OD segregation ultimately dominates. Erwin and Miller (1995) independently obtained this linear system decomposition.

Development of Orientation and Ocular Dominance

961

Table 1: Modes of Development Left

Left

Right

Right

ON

OFF

ON

OFF

Competition

Property

SSUM = (S1αE xE +S2αE xE +S3αE xE +S4αE xE ) αE xE

No competition

unspecific

SOD = (S1αE xE +S2αE xE −S3αE xE −S4αE xE ) αE xE

Left eye/right eye

OD

ON/OFF

OR+

ON/OFF

OR−

+

SαOR = (S1αE xE −S2αE xE +S3αE xE −S4αE xE ) E xE −

SαOR = (S1αE xE −S2αE xE −S3αE xE +S4αE xE ) E xE

Left

Right∗

eye

eye

∗ The

images show typical cortical receptive fields (arbor diameter A® = 13 grid points) for each mode. For unspecific and OD, the sum of the ON and OFF afferents is shown, whereas for OR+ and OR− their difference is shown (gray = 0).

The second assumption is translation invariance due to the locally homogeneous structure of the neuronal layers: AαE ,Ex = A(Eα−Ex) ,

IxE,Ey = I(Ex−Ey) ,

ij

ij

CαE ,βE = C(Eα−β) E .

(2.2)

Fourier transformation of equation 1.1 then yields X d ˆi ij j S E E(t) = Aˆ w, ∗ IˆkECˆ wE Sˆ E(t) − γ Sˆ iE E(t) − ² Aˆ w, E E E kE , k w,k E k w, dt w,k j

(2.3)

where ˆ denotes the Fourier transform of the corresponding quantity and ∗ is the convolution operator. Diagonalization of IˆCˆ ij yields a system that can easily be solved for its modes in the case of infinite arbors (AαE ,Ex = 1). We + − , λODE , λORE , and λORE , one corresponding to each obtain eigenvalues λSUM E kE E k E k E k w, w, w, w, mode in the table (Erwin & Miller, 1995; Piepenbrock, Ritter, & Obermayer, ij 1996). These eigenvalues depend on the input correlation functions Cˆ E and E k w,

the cortical interaction function IˆkE. The modes with the strongest eigenvalues will dominate the development process. The eigenvalue analysis for AαE ,Ex = 1 gives a good approximation of a realistic system’s behavior since a synaptic density function A with a finite radius mainly results in a smoothing in the spatial frequency domain due to the convolution but does not basically alter the principal mode. Based on these two assumptions, we can now derive conditions on the eigenvalues and (thereby indirectly) on the geniculate correlation functions and activity patterns. Most cortical receptive fields show ocular dominance, but they are unstructured with respect to input signals from the left and right eye (they have no left or right eye subfields). Consequently, for OD

962

Christian Piepenbrock, Helge Ritter, and Klaus Obermayer

E = 0 (Miller development λODE must have its maximum at spatial frequency w E k w, et al., 1989). Orientation selectivity in simple cells, on the other hand, is based on ON and OFF subfields that alternate at a typical spatial frequency within + the receptive field (Miller, 1994). Therefore we require λORE to have its peak E k w, E Hence, we make the ansatz: at this typical spatial frequency |w|. 1 −Eα2 /ϕ 2 e ϕ2 ¶ µ 1 −Eα2 /σ 2 1 −Eα2 /(νσ )2 cξ (2.4) = C1αE − C2αE + C3αE − C4αE = e − e σ2 σ 2ν2

= C1αE + C2αE − C3αE − C4αE = COD αE +

COR αE

C1αE = −C4αE ;

C2αE = −C3αE

and we obtain λODE = IˆkECˆ OD E ; w E k w,

OR λORE = IˆkECˆ w ; E +

+

E k w,

−

OR λSUM E = λ E = 0. E k w,

E k w,

(2.5)

The parameters ϕ, σ , and ν denote the widths of the activity correlation + functions, ξ denotes the ratio of λOR and λOD , and c is a normalization OR+ constant. The choice for λ E has a maximum at spatial ON/OFF waveE k w, p length πσ (ν 2 − 1)/(2 ln ν) of any orientation. Preference for a particular orientation is then a result of symmetry breaking due to the localized arbor function. OR may develop in our model framework driven by either the OR+ or the OR− mode. For OR− the receptive fields of binocular cortical cells have ON and OFF subfields exchanged in the left and right eye (see Table 1), but the resulting orientation maps are equivalent in both cases. + Thus we consider only λORE development. E k w, In a linear symmetric CBL model, OR and OD develop independent of each other. Equally strong development of OR and OD is not possible except at the infinitely sharp phase boundary. The structure of the phase boundary changes, however, as soon as nonlinearities are incorporated. Erwin and Miller (1995) recently presented simulations using appropriate nonlinearities and showed that a finite parameter regime emerges in which OR and OD codevelop. Let us investigate how this parameter regime varies with the choice of nonlinearities. In the purely linear case, the OR/OD phase + transition can be expected for equally strong modes λODE , and λORE , and we E k E k w, w, choose the parameters cξ in equation 2.4 such that both eigenvalues have 2 equal maxima for ξ = 1. This is the case for c = ν 2/(ν −1) /(1 − ν −2 ); that is, ξ + represents the ratio of the maxima of λORE and λODE . This type of correlation E k w,

E k w,

Development of Orientation and Ocular Dominance

963

structure allows us to analyze the mode mixing of OR and OD in a simple setting. For ξ ¿ 1 all neurons become monocular, and we expect the phase boundary at ξ = 1; for ξ À 1, binocular orientation selective receptive fields develop. We may biologically interpret the meaning of this eigenvalue structure. The development of OD requires positive activity correlations COD between neurons (averaged over ON and OFF) that decrease with distance within one eye (Miller et al., 1989). OR, however, can develop only if the activities + of ON and OFF neurons differ and COR ∝ C1 − C2 forms a Mexican hat– − like correlation structure (Miller, 1994). Our assumption of λSUM and λOR to be equal to zero is biologically not very realistic. It means that the activity correlations within the eye (C1αE and C2αE ) have to be of the opposite sign as the correlations between the eyes (C4αE and C3αE ) (see equation 2.5). Such correlation functions may be introduced by inhibitory connections between the geniculate layers but are unlikely after birth and eye opening when the left and right eye activities should be correlated. The assumption is not essential for a realistic development of receptive fields, however. Even − for nonzero λSUM and λOR , the outcome of the development is basically unchanged as long as these modes do not become the leading modes and govern the development process (Erwin & MIller, 1995; Piepenbrock et al., 1996). 3 Simulations To study the transition zone between OD and OR development, we start with a reduced model setup and add mechanisms one by one to analyze the effects of different nonlinearities separately. We systematically vary the following components of the model: • Geniculate (lateral geniculate nucleus, LGN) neurons project to the cortex in a topographic order. This projection is enforced by the synaptic density function AαE ,Ex . In our implementation, it is simply chosen as 1 for |E x − αE | < 12 A® and 0 else. To verify the approximation from equation 2.3, we also simulated a fully connected AαE ,Ex = 1 network. • Cortical neurons interact with each other. When they effectively excite each other at short distances and inhibit each other at larger distances, typical cortical maps develop for OD (with OD bands) and OR (with pinwheels). I is given by IxE,Ey = e−(Ex−Ey)

2

/µ2

−

1 −(E x−E y)2 /(ζ µ)2 ζ2 e

+ δxE,Ey .

(3.1)

We compared this with simple independent cortical neurons (IxE,Ey = δxE,Ey ) to study the influence of cortical interaction on orientation selectivity.

964

Christian Piepenbrock, Helge Ritter, and Klaus Obermayer

• Synaptic weights should be positive or negative. Weights that become zero in our model have to be clipped by a hard nonlinear constraint. To study the behavior of the linear model, we also consider synaptic weights that can change their sign during development. • Hebbian learning models require synaptic weight normalization. We use hard constraints that normalize the total weight at every time step and do not consider constraints over the projective field of LGN neurons. Subtractive and multiplicative constraints that keep the total synaptic weight constant are called S1 and M1, respectively. A multiplicative M2 constraint limits the total squared synaptic weight (Miller & MacKay, 1994). We consider two normalization approaches: total synaptic strength is constrained (1) for each cortical neuron separately or (2) over the whole cortex, a biologically less realistic model that (together with a multiplicative constraint) uniformly scales all synaptic weights but does not qualitatively alter the behavior of the unconstrained model. All simulations of cortical OR and OD development are run with cortical neurons arranged on a 32 × 32 grid with periodic boundary conditions. The random initial weights are identical for all simulations (AαE ,Ex ± 20 percent uniformly distributed noise), and parameters are A® = 13, ϕ = 7.1, σ = 2.1, ν = 3, µ = 2, and ζ = 3. At each iteration step we (1) compute the unconstrained derivatives and (2) apply an S1 constraint (if desired); then we (3) add the derivatives to the synaptic weights SiαE ,Ex , (4) clip the weights (if desired), and (5) normalize the weights multiplicatively (using M1 or M2 constraints). The S1 constraint applies only to unclipped synapses and keeps the total synaptic weight constant. Nevertheless, some weights might get clipped in step 4, and a multiplicative M1 renormalization (step 5) becomes necessary (this is the update scheme from Miller, 1994). The update step size is chosen at the first iteration to change the initial weights by 1 percent on average. The simulations run for 1000 update steps, or until 95 percent of the synaptic weights reach the limits. The receptive fields are usually fully developed after 100 to 200 steps (average weight change less than 1/1000 percent relative to initial weights), and test simulations with up to 5000 steps near the phase boundaries did not show any further change in the synaptic strengths. For each map we computed three different values: OD, OR, and OS. The OD value is the root mean square (r.m.s.) eye dominance (Erwin & Miller, 1995), and the OR value is the r.m.s ON/OFF subfield segregation value of the cortical neurons. Orientation-selective neurons need not only segregated ON and OFF subfields (high OR value) but also an arrangement of those subfields that yields orientation-specific responses (high OS value). The degree of orientation tuning is measured by the mean cortical orientation

Development of Orientation and Ocular Dominance

965

specificity OS (in the spirit of Miller, 1994): v Ã P OD !2 u X 1u αE SαE ,E x t (3.2) OD = P SUM N |S αE αE ,E x | xE v Ã P OR+ !2 u X 1u αE SαE ,E x t (3.3) OR = P SUM N |S αE αE ,E x | xE q¡ P ¡ ¢ ¢ ˆ OR+ | sin 2ϕwE 2 + P ˆ OR+ | cos 2ϕwE 2 X E =0 |Sw,E E =0 |Sw,E w6 w6 E x E x . (3.4) OS = P + N w6E =0 |Sˆ OR xE E x | w,E The OS value represents the sharpness of the neurons’ orientation tuning curves where each neuron is stimulated with wave patterns of different orientations ϕwE at different spatial frequencies ω (which results in wave vecE in the E = ω[cos ϕwE , sin ϕwE ]T ). The response to different wave patterns w tors w OR+ ˆ linear model is given by the power spectrum of Sw,E E x Fourier transformed along the LGN dimensions. OS values may range from 0 for cortical maps without any orientation preference up to OS = 1 for neurons that respond to exactly one orientation. Realistic values are generally 0.05 for no OS and about 0.1 to 0.3 for well-orientation-tuned neurons. The OS value alone is not sufficient as a measure for cortical orientation selectivity because it measures the width of the orientation tuning curve irrespective of its amplitude. For our simulation data, the orientation selectivity may be computed even if the cell orientation response rate (OR value) is almost zero. This is biologically unrealistic, and therefore we report the OS values only to show that the simulation results with high OR value actually yield orientation-selective cells. In a biologically more realistic definition, OR and OS would have to be determined for each eye separately weighted by the neurons’ eye dominance values. This, however, is not necessary in our simulations because our model yields only cortical maps with identical orientation patterns in both eyes. 4 Results In our simulations we begin with a biologically unrealistic simple linear setup of the model and then add nonlinearities and constraints. The results of the simulations are shown in Figure 1. In the simplest case (top left), we assume full connectivity between the LGN and the cortex, synaptic weights that may be negative or positive, and no cortical interaction. Synaptic growth is limited by an M2 constraint. The system is completely linear, and we find a sharp OD/OR phase boundary for ξ = 1 as predicted by the linear system analysis. A joint development of OD and OR is possible only

966

Christian Piepenbrock, Helge Ritter, and Klaus Obermayer

no interaction weights + /− arbor infinite constraint M2 no weight limit

weights + /− arbor local constraint M2 no weight limit

weights + arbor local constraint M1 no upper weight limit

weights + arbor local constraint S1 upper weight limit

weights + arbor local constraint S1 upper weight limit weight freezing

1

OD

OR

with interaction 1

OR

OD

OS

OS 0 0.1

1

1

OD

10

OR

0 0.1

1

1

OD

OR

OS 0 0.1

1

1

OD

10

OR

OS 0 0.1

1

1

OD

1

1

OD

10

OR

OS 0 0.1

1

1

OD

1

1

OD

10

OR

OS 0 0.1

1

1

OD

1

10

10

OR

OS 0 0.1

10

OR

OS 0 0.1

10

OR

OS 0 0.1

10

OS 0 0.1

1

10

Figure 1: Phase diagrams of cortical OR and OD development. Parameter on OR+ the abscissa is ξ (i.e., the ratio of the maxima of λw, and λOD ). For infinite E kE E kE w, ˆ arbors the OD/OR phase transition is at ξ = 1. Convolving the localized A + with λOR and λOD results in a predicted phase transition at ξ = 0.73. The left column represents simulation results where neurons develop independent of each other (no cortical interaction), whereas results with cortical interaction (and the constraint enforced separately for each neuron) are shown on the right side.

Development of Orientation and Ocular Dominance

967

if ξ is exactly one. In this fully connected network, however, neurons may develop ocular dominance and ON/OFF receptive subfields, but the cells do not become orientation selective (low OS value) because the receptive fields have a large number of ON and OFF subfields in a random pattern. Orientation selectivity requires a certain ratio between arbor diameter and the strongest-growing spatial frequency mode. In our case, the arbor function localizes synaptic growth and thus smooths the growing modes in the spatial frequency domain. This results in a shift of the phase boundary from ξ = 1 to a lower value. The system is still linear, and a joint development of OD and OR does not occur. Now we introduce nonlinear mechanisms to study conditions for coupled OR and OD development. When we couple the development of neighboring neurons by introducing a cortical interaction function, we do not necessarily make the system nonlinear. If we enforce the constraint over the whole cortex (biologically not very realistic), the system remains linear, and we obtain graphs very similar to the ones shown in the left column (data not shown). Their orientation specificity, however, is slightly lower than in the case of independent cortical neurons (as has been noted by Miller, 1994) because the interactions force some cells to develop receptive fields that are not ideally tuned for orientation (e.g., around pinwheels). An M1 constraint simply scales all the weights, but if we apply it separately over each cortical receptive field (as for all results shown in Figure 1), then introducing an interaction function makes the system nonlinear. As a result we obtain a finite parameter domain in which OR and OD may develop concurrently. Synaptic weights should not change their sign in a biologically realistic model. Therefore we have studied models where we clip the weights at zero. We compare the results for zero weight clipping with M1 (a scheme we have used in Piepenbrock et al., 1996) to zero and upper weight clipping with S1 constraints (used in Erwin & Miller, 1995). The two growing modes + SOD and SOR are both zero-sum modes; their growth does not alter the summed synaptic weight over each receptive field. Thus, under both M1 and S1 constraints, synaptic growth is initially unconstrained. Only when some synapses have reached their clipping values does the constraint begin to limit growth (Miller & MacKay, 1994). The only non-zero-sum modes are modes of SSUM , so only if SSUM modes had nonzero eigenvalues would the constraint influence the development process from the beginning. The graphs in the third row show simulation results for only positive synaptic weights. We use an M1 constraint that simply scales the weights (like an M2 constraint). The only significant difference between the second and the third row is the clipping of weights at zero. In the absence of intracortical connections (left column), this creates a finite parameter domain for joint OD/OR development. If the weights are allowed to change their sign, an M2 constraint is necessary to limit synaptic growth (second row). For weight clipping, however, an M1 (third row, left) and an M2 (data not

968

Christian Piepenbrock, Helge Ritter, and Klaus Obermayer

shown) constraint yield virtually identical results in the case of independent cortical neurons. Both constraints scale the synaptic weights but do not qualitatively alter the outcome of the unconstrained development, and the only nonlinearity is the weight clipping. The last two rows present the simulation results for subtractive constraints. In this case the weights have to be clipped at zero as well as at some maximum value (Miller & MacKay, 1994). Without maximum weight clipping, all but one synaptic weight within the receptive field would decay to zero. The size of the final receptive fields can be influenced by the choice of the maximum value, and we chose eight times the mean initial synaptic weight, as in Miller (1994). The difference between the last two rows is the treatment of synaptic weights that have saturated at zero or at the maximum value. If we limit the synaptic weights but allow them to change again later in the development process, we obtain the graph in the second-last row. The last row results if weights are frozen as soon as they reach the minimum or maximum value, as in Erwin and Miller (1995). OR and OD develop about equally strong near the phase boundary, and the amount of “randomness” in the initial synaptic weights determines how long it takes until the stronger mode begins to dominate the development process. If the weights are frozen at their limit values before this happens, we observe a joint development of OR and OD. On the other hand, if they are only clipped at the limits, the weights may change again later, and the influence of the initial conditions is reduced. This effect yields a slightly larger joint development parameter domain for the frozen synaptic weights in the bottom graphs. The simulations in the bottom-right graph use the same mechanisms as Erwin and Miller (1995) and almost identical parameters.1 It is not surprising that the last graph exhibits the largest parameter domain for joint OR and OD development since the simulation includes several nonlinear mechanisms— a minimum and a maximum synaptic weight value, as well as freezing of the weights that reach the limits. In the results shown in Figure 1, we varied the nonlinear mechanism but did not systematically change all the other model parameters. Our simulations (including simulations using the parameters from Erwin & Miller, 1995), however, indicate that the effects of the nonlinear mechanisms are robust against changes in ϕ, σ , ξ , ζ , and µ, and the parameter domains of joint OR and OD development remain qualitatively unchanged (i.e., the sharp phase transition of the purely linear CBL model changes into a parameter region of joint OR/OD development in a nonlinear model). This region is enlarged when more nonlinear mechanisms are added (as in Figure 1).

1 They report better orientation selectivity for smaller parameters σ and for a tapered arbor function AαE ,Ex . The effects of changes in the correlation functions on OR development are discussed in Miller (1994).

Development of Orientation and Ocular Dominance

969

5 Summary The main result of this note is that OR and OD decouple in the linear symmetric model and a sharp phase boundary divides the parameter domains of OR and OD. Nonlinear mechanisms may create a transition region near the phase boundary where a joint development of OR and OD is possible. We have identified three such mechanisms: (1) normalizing synaptic weights separately for each cortical neuron with M1/M2 constraints, (2) clipping of synaptic weights, and (3) freezing of synaptic weights. The size of the transition region depends on the nonlinearities present in the model, and the transition zone is broadest if the mechanisms are applied simultaneously. The basic assumption underlying CBL is that the modeled receptive field − + features develop early, when the developments of SOD , SOR , and SOR are essentially linear. This is true for ξ ¿ 1 or ξ À 1, and all our simulations yield almost identical OD or OR maps. Close to the phase boundary, however, where OD and OR develop concurrently, the nonlinear mechanisms we have studied (M constraints, weight clipping, weight freezing) lead to different outcomes for identical initial conditions. Hence, the final receptive field properties are determined late at a time when the nonlinearities set in to stabilize the receptive fields. When discussing the biological significance of a model, it is very important to investigate carefully the effects of the different mechanisms involved. CBL models require constraints to limit synaptic growth, and for this reason we tested different types of constraints in our simulations. These constraints are only clumsy approximations for the real biological mechanisms, but the simulations give hints on which aspect of the constraining mechanism may be most important for the joint development of OR and OD. Multiplicative constraints implement synaptic decay that is proportional to the synaptic strength, whereas under subtractive constraints, all synapses decay at the same rate, irrespective of their strength. Furthermore, subtractive constraints require weight clipping not only at zero but also for some maximum value (Miller & MacKay, 1994). It is plausible that such constraints exist in biological neurons (Miller & MacKay, 1994), but their underlying mechanisms have not been identified yet. The simulation results show that the size of the parameter domain for joint OD/OR development depends more on weight clipping or freezing and on separate constraints for each neuron than on the choice of M1 or S1 constraints. Furthermore, this parameter domain is enlarged if different nonlinear mechanisms act together. Note After part of this work was completed, we learned that Erwin and Miller (1995) independently derived the diagonalization of equation 1.1. We thank Ken Miller for his review and discussions that helped to improve this paper.

970

Christian Piepenbrock, Helge Ritter, and Klaus Obermayer

Acknowledgments This research was funded in part by DFG (grant Ob 102/2-1) and a scholarship from the Boehringer Ingelheim Fonds to C. P. Computing time (CM5) was made available by NCSA and HLRZ Julich. ¨ References Erwin, E., & Miller, K. D. (1995). Modeling joint development of ocular dominance and orientation in primary visual cortex. In Proc. of the CNS-95. Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Computation 7:425–468. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation selective cells. Proc. Natl. Acad. Sci. USA 83:8390–8394. Miller, K. D. (1990). Derivation of linear Hebbian equations from a nonlinear Hebbian model of synaptic plasticity. Neural Computation 2:321–333. Miller, K. D. (1994). A model for the development of simple cell receptive fields and the ordered arrangements of orientation columns through activitydependent competition between ON- and OFF-center inputs. Journal of Neuroscience 14:409–441. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science 245:605–615. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Comp. 6:100–126. Miyashita, M., & Tanaka, S. (1991). A mathematical model for the selforganization of orientation columns in visual cortex. Neuroreport 3:69–72. Obermayer, K., Blasdel, G. G., & Schulten, K. (1992). A statistical mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev. A 45:7568–7589. Piepenbrock, C., Ritter, H., & Obermayer, K. (1996). Cortical map development driven by spontaneous retinal activity waves. In Artificial Neural Networks— ICANN 96. New York: Springer-Verlag.

Received September 15, 1995; accepted November 26, 1996.

NOTE

Communicated by Michael Shadlen

Physiological Gain Leads to High ISI Variability in a Simple Model of a Cortical Regular Spiking Cell Todd W. Troyer Keck Center for Integrative Neuroscience, Department of Physiology, University of California, San Francisco, San Francisco, CA 94143 USA

Kenneth D. Miller Keck Center for Integrative Neuroscience and Sloan Center for Theoretical Neurobiology, Departments of Physiology and Otolaryngology, University of California, San Francisco, San Francisco, CA 94143 USA

To understand the interspike interval (ISI) variability displayed by visual cortical neurons (Softky & Koch, 1993), it is critical to examine the dynamics of their neuronal integration, as well as the variability in their synaptic input current. Most previous models have focused on the latter factor. We match a simple integrate-and-fire model to the experimentally measured integrative properties of cortical regular spiking cells (McCormick, Connors, Lighthall, & Prince, 1985). After setting RC parameters, the postspike voltage reset is set to match experimental measurements of neuronal gain (obtained from in vitro plots of firing frequency versus injected current). Examination of the resulting model leads to an intuitive picture of √ neuronal integration that unifies the seemingly contradictory 1/ N and random walk pictures that have previously √ been proposed. When ISIs are dominated by postspike recovery, 1/ N arguments hold and spiking is regular; after the “memory” of the last spike becomes negligible, spike threshold crossing is caused by input variance around a steady state and spiking is Poisson. In integrate-and-fire neurons matched to cortical cell physiology, steady-state behavior is predominant, and ISIs are highly variable at all physiological firing rates and for a wide range of inhibitory and excitatory inputs. 1 Introduction Softky and Koch (1993) have shown that the spike trains of cortical cells in visual areas V1 and MT display a high degree of variability, as measured by their interspike interval (ISI) distributions at fixed mean firing rates. Does this result require revising the classical notion of information transmission in cortical neurons? By “classical notion,” we mean the view that neurons integrate many synaptic inputs until reaching some voltage threshold, and at that point produce a spike. Spikes are then followed by a refractory period, Neural Computation 9, 971–983 (1997)

c 1997 Massachusetts Institute of Technology °

972

Todd W. Troyer and Kenneth D. Miller

during which the cell is less likely to produce another spike. If inputs are uncorrelated, the mean spike rate of such a model neuron depends on the average frequency of presynaptic events and thus can embody the standard notion of rate coding. Softky and Koch (1993) argued that high ISI variability is inconsistent with a neuron’s acting as an EPSP integrator. The integration of many uncorrelated excitatory synaptic potentials (EPSPs) should result in a time to threshold (or ISI) that is very regular, since the neuron averages over many random events. One measure of spike variability is the coefficient of variation (CV) of the ISI distribution, defined as the standard deviation divided √ by the mean. For an EPSP integrator, this should be approximately 1/ N, where N is the number of events necessary to reach threshold. In contrast, cortical cells have CVs in the range .5 to 1, more closely resembling a random (i.e., Poisson) process for which CV = 1. As an alternative to simple integration, Softky and Koch (1993) proposed that active dendritic conductances may act to amplify submillisecond coincidences in synaptic input, leading to large random pulses of synaptic current. Thus, they suggested that high ISI variability may be more consistent with the notion (Abeles, 1982) of neurons acting as “coincidence detectors” rather than “rate encoders.” Shadlen and Newsome (1994), following Gerstein and Mandelbrot (1964), argued that if uncorrelated synaptic inputs to a cell consist of balanced excitation and inhibition, then the membrane voltage follows a random walk, leading to ISIs that are quite variable. This balanced inhibition model is consistent with the classical notion of synaptic integration and rate coding but requires that excitation and inhibition be tightly balanced in order to achieve ISI variability consistent with the data. These arguments can be more clearly understood by considering spike production as a two-step process. First, synaptic inputs are integrated by an extensive and complex dendritic tree resulting in a total synaptic current I(t). Second, the cell produces spikes in response to this synaptic current. Thus, we expect that spike variability should depend on two factors: the sensitivity of the spike mechanism to changes in the synaptic current and the variability in this input current. Previous discussions of ISI variability have largely focused on the second term, that is, under what conditions sufficient input variability can be achieved. Thus, Softky and Koch (1993) argued that input current variability was too low to account for output spiking variability, while Shadlen and Newsome (1994) argued that a balance of excitation and inhibition could yield high input variability. Other models demonstrated that network dynamics can lead to correlations among synaptic inputs sufficient to yield high variability (Usher, Stemmler, Koch, & Olami, 1994; Hansel & Sompolinsky, 1996) (afferent inputs also have significant correlations; Alonso, Usrey, & Reid, 1996). However, these investigations generally did not address the sensitivity of cortical neurons. Neuronal sensitivity has been addressed by Bell, Mainen, Tsodyks, and Sejnowski (1995), who considered some of the

Variability in a Cortical Regular Spiking Cell

973

same factors as our work in the context of a Hodgkin-Huxley model.1 Our work differs in considering simple integrate-and-fire dynamics focusing on the roles of gain and refractoriness, and, most important, matching model parameters to experimental data from cortical neurons. For the purposes of this article, we define neuronal gain as the slope of a plot of firing frequency f = 1/ISI versus the (constant) level of injected current I. Although there is no direct relationship between this measure and neuronal sensitivity under physiological conditions—neuronal gain is measured in the absence of input variability—we will argue that the two should be well correlated. Here we demonstrate that matching a simple integrateand-fire model to experimentally measured values of neuronal gain can largely solve the conundrum of high ISI variability in visual cortical cells with many random synaptic inputs. A careful examination of this model leads to a relatively simple, intuitive picture√ of cortical neuronal integration that unifies the seemingly contradictory 1/ N and random walk pictures that have been previously proposed. A preliminary report of this study has appeared as an abstract (Troyer & Miller, 1995). 2 The High-Gain Model In an integrate-and-fire model, the slower, integrative properties of neurons are modeled as passive changes in subthreshold membrane voltage. The fast spiking conductances are lumped into a stereotyped event that is “pasted onto” the voltage trace when the cell reaches threshold. After spiking, there is an absolute refractory period during which the cell cannot spike, followed by a relative refractory period during which the cell’s ability to spike is reduced. We will use refractoriness to refer to the combined effect of all processes occurring after a spike that make a cell less likely to respond to a given pattern of input current with a subsequent spike. Refractoriness can be defined quantitatively as a function r of the time t after a given spike as follows. Suppose a just-threshold DC current is applied at or before the spike time, t = 0. The refractoriness, r(t), is the magnitude of a brief current pulse that, superimposed on this ongoing DC current at time t, is just sufficient to elicit a spike.2 Note that this definition includes all factors that contribute to the cell’s refractoriness, including sodium channel inactivation and the ef1 Bell et al. (1995) and Wilbur and Rinzel (1983) also investigated bistability in neuronal dynamics as a mechanism that could contribute to ISI variability. 2 Assume a spike occurs at t = 0, and that after an absolute refractory period t refract the voltage is reset to Vreset . Let τ be the membrane time constant, gleak the leak conductance, and 1t the duration of the brief current pulse. Beginning from Vreset at time t = trefract , let It0 be the DC current that leads to a spike at time t0 (so I∞ is the just-threshold DC current). τ gleak (Vt (t) − V∞ (t)) = Let Vt0 (t) be the time course of the voltage given It0 . Then r(t) ≡ 1t τ −(t−t )/τ refract (It − I∞ )(1 − e ) 1t .

974

Todd W. Troyer and Kenneth D. Miller

fects of other active conductances, as well as spike after-hyperpolarization.3 In the simplest models, refractory effects are modeled by waiting for a fixed, absolute refractory period after spike onset and then resetting the voltage to a value significantly below threshold. Thus, a single parameter, the depth of the after-spike voltage reset, serves to model all of the elements contributing to the cell’s relative refractoriness.4 More realistic models of the complex events following a cortical spike may deepen our understanding of cortical integration, but such models are experimentally poorly constrained and can lose the clarity motivating the use of simple models. We use a simple conductance-based integrate-and-fire model: C

∂V = gleak (Vrest − V) + I. ∂t

(2.1)

Parameters were matched to slice recordings of regular spiking cells (values from McCormick et al., 1985, noted in parentheses): Vrest = −74 (73.6 ± 1.5) mV; 1/gleak = 40 (39.9 ± 21.2) MÄ; C = τ gleak where τ = 20 (20.2 ± 14.6) msec.5 The absolute refractory period was taken to be the spike width, tspike = 1.75 (1.74 ± .41) msec. Spike threshold Vthresh was set 20 mV above Vrest : Vthresh = −54 mV. This leaves the after-spike reset voltage, Vreset , as the only free parameter. This parameter is commonly set equal to Vreset , but this choice lacks physiological justification. We set Vreset to match the model’s f-I curve to that observed physiologically, as reported in McCormick et al. (1985) (see Figure 1A). That is, we set Vreset to match the physiologically observed neuronal gain.6 This yields Vreset = −60 mV, 6 mV below threshold. Using the f-I curve to set Vreset points to the strong connection between refractoriness and neuronal gain. Once τ and gleak are determined from biophysical measurements, neuronal gain is determined by the cell’s refractory behavior. A cell that is more refractory requires more input current to spike at any given frequency, resulting in an inverse relationship between refractoriness and neuronal 3 After-hyperpolarization—the voltage state of the cell—is often regarded as affecting integration rather than refractoriness, with “refractoriness” reserved for processes that alter the cell’s spike responses at a given voltage, such as spike threshold elevation. When considering a cell’s reduced responsiveness to input currents, however, it is most useful to combine all factors contributing to the reduction. 4 Alternatively, one may consider a two- (or three-) parameter model of refractoriness in which instead of (or in addition to) postspike voltage reset, there is a postspike elevation of threshold that decays at some rate (reviewed in Tuckwell, 1988, Table 3.1). 5 The term regular spiking denotes the class of cortical cells that respond to constant current injection in vitro with single-spike firing and spike-rate adaptation (McCormick et al., 1985). These constitute most cortical excitatory cells. The term does not imply regularity of firing in vivo. 6 Matching the slope of the model f-I curve at 100 Hz to the average gain reported by McCormick et al. (1985) for cortical regular firing cells gives Vreset = −59.75 mV.

Variability in a Cortical Regular Spiking Cell

975

Figure 1: (A) f-I plots for our high-gain model (solid line) and experiment (dashed line). For comparison, the dotted line represents the more common low-gain model, in which Vreset is set to Vrest . Experimental data shown are piecewise linear fits to data for first interspike interval from three cells (McCormick et al. 1985, Fig. 1). (B) ISI histogram from 10 sec of model data: λex = 7015 Hz; λin = 3263 Hz (R = .75); average firing rate = 98.4 Hz; CV = .6. Inset shows ISIs > 20 msec. (C) Voltage trace of first 200 msec of simulation in (B).

gain.7 Similarly, greater refractoriness implies weaker sensitivity to physiological inputs. This explains why we connect high gain measured with constant currents to high sensitivity under physiological conditions: For a given membrane time constant and input resistance, weaker refractoriness implies both higher gain and higher physiological sensitivity. Synaptic input took the form P of Poisson distributed delta function conductance changes, Isyn (t) = k gsyn δ(t − tk )(Vsyn − V), where tk denotes the τ Using the definitions of note 2, r(t) = (It − I∞ )(1 − e−(t−trefract )/τ ) 1t . Writing gain G( f ) as a function of firing frequency f (G( f ) is the slope of the f -I curve at f ), this 7

τ yields r(t) = (1 − e−(t−trefract )/τ ) 1t

refractoriness.

R f =1/t f =0

df/G( f ). Thus, high gain corresponds to low

976

Todd W. Troyer and Kenneth D. Miller

arrival time of the kth presynaptic spike. gsyn represents the total conductance, integrated over the time course of the synaptic event, and thus has units nS msec. gsyn was taken from an exponential distribution with mean g¯ syn ; values greater than 4 g¯ syn were then reset to this maximum value.8 For excitatory synapses, Vex = 0 mV, and g¯ ex = 3.4 nS msec. This yields EPSPs at rest of mean .49 mV and maximum 2 mV. For inhibitory synapses, Vin = −70 mV, and g¯ in = 22.8 nS msec. This yields inhibitory postsynaptic potentials (IPSPs) that are twice as large as EPSPs at threshold. The amount of inhibition was expressed as the ratio R of the mean inhibitory current to the mean excitatory current at threshold: R=

λin g¯ in |Vin − Vthresh | . λex g¯ ex |Vex − Vthresh |

(2.2)

Here λ expresses the mean rate of Poisson input. Note that R = 1 implies that when voltage is clamped at Vthresh , the total synaptic current has mean zero; R < 1 corresponds to a surplus of excitatory synaptic current at threshold, while R > 1 corresponds to a surplus of synaptic inhibition. The leak current adds hyperpolarizing current. Estimates for λex and λin rely on counting arguments only weakly constrained by experimental data (Shadlen & Newsome, 1994). Thus, to examine how model behavior depends on λex and λin , we fix R and then covary λex and λin to yield varying output firing rates for that R. We have considered values of R ranging from 0 to 1.25; we will take R = .75 to be a reasonable guess for the ratio of inhibition to excitation in cortex. Balanced inhibition models (Gerstein & Mandelbrot, 1964; Shadlen & Newsome, 1994) commonly assume equally likely positive and negative voltage steps, yielding a random walk in voltage. In a conductance-based model, this translates into a balance of inhibition and excitation sufficient to give a subthreshold mean current (so that there is not a steady drift up to near Vthresh ). We will roughly equate balance inhibition with R ≥ 1 (although smaller values are conceivable, since even for R = 1 the leak current maintains the mean voltage somewhat below threshold).9 3 Simulation Results The results of a typical simulation (R = .75) are shown in Figure 1. Postsynaptic average firing rate was 98 Hz for 10 sec of simulated data. We measured variability by taking the CV of the ISI histogram, and found CV = .6, which falls in the physiological range of .5 to 1 reported in Softky and Koch (1993). Thus the mean of the final distribution for gsyn is (1 − e−4 ) g¯ syn . The exact distance below threshold depends on the relative magnitudes of the leak and mean synaptic conductances. In the limit of high input rates, the leak is negligible, and R = 1 corresponds to a mean synaptic current just sufficient to elicit depolarization to threshold. 8 9

Variability in a Cortical Regular Spiking Cell

977

What expectations do we have for CV in these simple models? We expect CV to be bounded above by the CV of a Poisson process with dead time equal to the minimum recovery period. With mean ISI µ and dead time tdead , CVmax = (µ − tdead )/µ. For the simulation in Figure 1, tdead = tspike gives CVmax = .83, while taking tdead equal to the shortest interval observed (2.75 msec) yields CVmax = .73.10 A rough lower bound can be obtained from the case of pure EPSP integration. In our high-gain model, it takes 16 average-sized EPSPs to reach threshold from reset (recall that EPSPs are reduced by three-fourths near threshold). Again accounting for dead time √ √ of tdead = tspike , a 1/ N argument yields CV = .83 16 = .2. With variablesized PSPs, this estimate should be adjusted upward by a factor of roughly √ 2√yielding CV ≈ .28 (Softky & Koch, 1993; Stein, 1967). The reason that 1/ N arguments fail to estimate CV accurately will be discussed shortly. We measured the dependence of CV on postsynaptic firing rate by varying input rates while keeping the inhibition-to-excitation ratio fixed at R = .75 (see Figure 2A). Physiologically reasonable levels of CV are obtained across fiting rates. Also, CV decreases at higher firing rates, as observed experimentally (Softky & Koch, 1993, Fig. 3). To test the notion the a balance of excitation and inhibition is required to achieve high variability (Shadlen & Newsome, 1994; Bell et al., 1995), we varied the inhibition ratio R and observed ISI variability at input rates that yield postsynaptic firing rates between 95 and 105 Hz (see Figure 2B). To compare more closely with previous models that have used low gain,we also ran simulations with Vreset = Vrest . Balanced inhibition models are those with large R (e.g., R = 1.25),11 while the low-gain model with excitation only (R = 0, marked “O”) is similar to the simple EPSP integrator discussed in Softky and Koch (1993). The model with physiological (high) gain gives physiological CV over a broad range of R. In contrast, balanced inhibition (R > 1) is necessary to achieve CV > .5 in the low-gain model. 4 An Intuitive Picture An intuitive grasp of spike variability in simple integrate-and-fire models can be obtained by roughly dividing the ISI into three regimes (see Figure 3A; see also Abeles, 1991, Chap. 4; Smith, 1992). In the initial refractory regime, the state of the neuron√is dominated by he recovery from the previous spike. In this regime 1/ N arguments are valid (Smith, 1992), and spiking is regular. In the final, steady-state regime, the memory of the last spike has decayed and random synaptic variation causes the cell to fluctuate about

10 For a Poisson process with dead time, the shortest interval, the dead time, and the peak of the ISI distribution are the same. 11 Shadlen and Newsome (1994) used V reset = Vrest and hence considered a low-gain balanced inhibition model.

978

Todd W. Troyer and Kenneth D. Miller

Figure 2: (A) CV versus output firing rate, for fixed inhibition ratio R = 0.75. Data shown are mean ± standard deviation for 10 simulations (using different Poisson trains of inputs) of 10 sec duration. λex ranged from 3396 Hz to 20,509 Hz, while λin correspondingly varied from 1274 Hz to 7691 Hz. The thin line shows CV for Poisson process with dead time equal to the shortest ISI observed. With increasing firing rate, the refractory period is a greater fraction of a typical ISI. Thus CV decreases with increasing rate. This effect is also seen in the biological data (Softky & Koch, 1993, Fig. 3). (B) CV versus inhibition ratio R for high gain (solid line; Vreset = −60 mV) and low gain (dashed line; Vreset = Vrest ) models. R = 1.25 corresponds to example balanced inhibition models. Low gain with R = 0 corresponds to the EPSP integrator model (marked “O”). At a given level of R, values of λex and λin were found for which the 95 percent confidence interval for the mean spike rate was within the range from 95 to 105 Hz (10 simulations of 10 sec each). Data shown are from 10 subsequent trials at these input rates. λex ranged from 4448 Hz to 19,584 Hz for high gain and from 7500 Hz to 23,261 Hz for low gain; λin varied from 0 Hz to 12,240 Hz (high gain) and from 0 Hz to 14,538 Hz (low gain). The dotted line shows CV = .5.

some steady-state mean voltage (necessarily subthreshold). Thus, threshold crossings are equally likely to occur in any small interval and spike statistics are Poisson. In the intermediate regime, the mean voltage is still significantly rising, yet typical voltage fluctuations can be sufficient to cross threshold. √ Spike variability in this regime should be a mixture of 1/ N integration effects and Poisson random crossings. ISI variability depends on the relative amount of time spent in the regimes. As shown in Figure 3B, integration in the EPSP integrator model is primarily in the refractory regime, and hence spiking is regular. In the low-gain balanced-inhibition model, large, transient changes in the balance of excitation and inhibition are large enough to dominate the cell’s refractoriness

Variability in a Cortical Regular Spiking Cell

979

Figure 3: Mean ± standard deviation of the distribution of voltages. (A) Schematic showing three regimes of behavior (see text). (B) Data obtained from repeated (n = 10,000) experiments starting the model (without spiking) from Vreset . Rates set to achieve firing rates of 100 (±5) Hz. Solid lines are from the high-gain model (Vreset = −60 mV; R = .75; λex = 8885 Hz; λin = 3332 Hz). Dashed lines are low-gain models (Vreset = −74 mV = Vrest ). Dashed lines are from the EPSP integrator (R = 0; λex = 7500 Hz; λin = 0 Hz). Dotted lines are from a low-gain, balanced-inhibition model (R = 1.25; λex = 23,261 Hz; λin = 14,538 Hz). The arrow points to mean ISI.

rather quickly, and the cell spends most of its time in the intermediate and steady-state regimes. Note that this recovery is aided by a reduction of the membrane time constant due to the large synaptic conductance. The small reset of the high-gain model causes the cell to spend most of the time in the intermediate or steady-state regimes for virtually any input combinations that yield biologically reasonable firing rates. Note that any mechanism that acts to reduce the importance of the transient, refractory regime contributes to variable spiking. In our model, Vreset is the dominant parameter determining the significance of the refractory regime. However, reducing the membrane time constant τ shortens transients and can emphasize steady-state behavior. Increasing the variability of the synaptic current results in a more rapid transition to the intermediate regime and hence will also increase spike variability. Thus balanced inhibition, which reduces τ and increases input variability, and high-gain mechanisms are independent effects that can contribute to variable spiking. It has been suggested (Bell et al., 1995) that spike variability depends on the cell’s “hovering” near threshold, so that the integration of only a few inputs can cause a spike. While this is similar to our explanation of steadystate behavior, there is a significant difference. In the steady-state regime, threshold crossings result from random fluctuations in the input current,

980

Todd W. Troyer and Kenneth D. Miller

so spiking is always Poisson, regardless of steady-state voltage level. Thus, the nearness to threshold in the steady-state regime affects spike rate but does not directly affect spike variability. 5 Discussion We have shown that matching simple integrate-and-fire neurons to experimental data results in ISI distributions with physiological CV. In additional studies, we have found that high CV is also produced by more realistic integrate-and-fire models incorporating the finite time course of synaptic conductances, realistic combinations of fast and slow excitatory synapses (AMPA and NMDA), and spike rate adaptation. Lengthening the synaptic time course can result in input correlations longer than the shortest ISIs and lead to burstlike behavior that can actually increase CV; a realistic mixture of fast and slow synapses avoids such burstiness while achieving high CV.12 Does an understanding of the intermediate and steady-state regimes have implications for neural coding? In these regimes, both the mean synaptic current and the variance in this current contribute to neuronal spiking. A neuron operating in these regimes is certainly capable of rate coding— coding mean input rates with a mean output rate. When inputs are Poisson, the mean input rate of excitatory and inhibitory events determines both the mean and the variance of the total synaptic current and consequently the mean output rate. However, such a neuron is also capable, in principle, of responding to input fluctuations with high temporal resolution. The possible contributions of these different forms of coding will depend on signal-tonoise considerations—how reliably a given statistical attribute of the input can be transmitted and be distinguished from random input fluctuations— that will depend on the details of the statistics of the neuronal input. Thus, the fact of high CV alone does not serve to distinguish between “rate coding” and “coincidence detection” or “spike timing” models of neural encoding. One difficulty with high-gain models is that they display low dynamic range; relatively small fluctuations in input current can lead to saturated outputs. In other studies (Troyer & Miller, 1995), we have found that spike rate adaptation helps to solve this problem by providing negative feedback that reduces gain on longer time scales, while preserving sensitivity to small input fluctuations on the time scale of typical interspike intervals. This fits well with the empirical observations that all excitatory cortical neurons show spike rate adaptation, while most inhibitory cortical neurons do not adapt but can sustain high rates of firing without saturating (McCormick et al., 1985).

12 The burstlike behavior gives an unrealistically low CV2, a measure of variability that compares only adjacent interspike intervals (Holt, Softky, Koch, & Douglas, 1996). The mix of fast and slow synapses gives a realistically high CV2.

Variability in a Cortical Regular Spiking Cell

981

Our high-gain model results in ISI variability that falls in the lower half of the range .5 to 1 reported in Softky and Koch (1993). If one includes the refractory period, even a Poisson model cannot account for the upper reaches of these data. The most likely explanation is that cortical ISI variability results from a combination of mechanisms (including high gain) that have generally been explored in isolation. However, ISI variability alone is inadequate to distinguish the relative contributions of these mechanisms to cortical integration. Stronger constraints may be found from experiments designed to probe the biophysical basis of integration and spiking in cortical cells and from further statistical studies. The latter might examine, in vivo, the statistics of synaptically driven voltage and current fluctuations as well as spike train statistics beyond CV, and in vitro, the dependence of firing rate and CV on both mean and variance of white noise input current. One recent technique that may shed light on the contribution of cortical inhibition is the ability to block inhibitory input to a single cell pharmacologically (Nelson, Toth, Sheth, & Sur, 1994). To understand cortical ISI variability, classical notions of neuronal integration must be modified, not by abandoning the integrate-and-fire neuron but by deepening our understanding of the dynamics of such model neurons. In particular, the common intuition of dynamics dominated by recov√ ery, with 1/ N integration behavior, must be supplemented by an understanding of a regime in which spike behavior is Poisson. Such a regime has been well studied in voltage-based integrate-and-fire models using a variety of statistical approaches (Gerstein & Mandelbrot, 1964; Stein, 1967; Smith, 1992; Tuckwell, 1988, Chap. 9); in certain limits the regime can be described as an unbiased random walk of voltage with decay. The importance of such a random, steady-state regime for cortical processing has been discussed by several previous authors (Abeles, 1991; Shadlen & Newsome, 1994; Bell et al., 1995). Our contributions are to make specific links between refractoriness and neuronal gain and to demonstrate that an integrate-and-fire model fitted to known cortical physiology will operate in the intermediate and steady-state regimes for physiologically reasonable firing rates and a wide range of ratios of inhibition to excitation. In these regimes, variable spike statistics dominate, and hence the model produces physiological CV. Thus, in contrast to the intuition presented in Softky and Koch (1993), one should expect a high degree of ISI variability from cortical neurons. Of course, a full biophysical understanding of cortical integration may yet require abandonment of this simple view of cortical processing, but the fact of cortical ISI variability alone does not. Acknowledgments We thank Larry Abbott, Paul Bush, Mark Kvale, Anton Krukowski, Neal Hessler, and Svilen Tzonev for helpful comments on this article. This work

982

Todd W. Troyer and Kenneth D. Miller

was supported by an NSF Mathematical Sciences Postdoctoral Research Fellowship (T.W.T.) and by a Whitaker Foundation Biomedical Engineering Research Grant, the Searles Scholars’ Program, and the Lucille P. Markey Charitable Trust (K.D.M.). References Abeles, M. (1982). Role of the cortical neuron: Integrator or coincidence detector? Israel J. Med. Sci., 18, 83–92. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Alonso, J. M., Usrey, W. M. & Reid, R. C. (1995). Precisely correlated firing in cells of the lateral geniculate nucleus. Nature, 383, 815–819. Bell, A. J., Mainen, Z. F., Tsodyks, M., & Sejnowski, T. J. (1995). Balanced conductances may explain irregular cortical spiking (Tech. Rep. No. INC-9502). San Diego, CA: Institute for Neural Computation, University of California at San Diego. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Comput. Neurosci., 3, 7–34. Holt, G. R., Softky, W. R., Koch, C., & Douglas, R. J. (1996). A comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. J. Neurophsiol., 75(5), 1806–1814. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophysiol., 54, 782–805. Nelson, S., Toth, L., Sheth, B., & Sur, M. (1994). Orientation selectivity of cortical neurons during intracellular blockage of inhibition. Science, 265, 774–777. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Smith, C. E. (1992). A heuristic approach to stochastic models of single neurons. In T. McKenna, J. Davis, & S. F. Zornetzer (Eds.), Single Neuron Computation (pp. 561–588). San Diego, CA: Academic Press. Softky, W., & Koch, C. (1993). The highly irregular firing of cortical-cells is inconsistent with temporal integration of random EPSPS. J. Neurosci., 13, 334–350. Stein, R. B. (1967). The information capacity of nerve cells using a frequency code. Biophys. J., 7, 797–826. Troyer, T. W., & Miller, K. D. (1995). A simple model of cortical excitatory cells linking NMDA-mediated currents, ISI variability, spike repolarization and slow AHPs. Soc. Neuro. Abs., 21, 1653. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplification of local fluctuations causes high spike rate variability, fractal firing patterns and oscillatory local field potentials. Neural Comp., 6, 795–836.

Variability in a Cortical Regular Spiking Cell

983

Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike interval distributions. J. Theor. Biol., 105, 345–368.

Received January 12, 1996; accepted July 29, 1996.

Communicated by Anthony Bell

Role of Temporal Integration and Fluctuation Detection in the Highly Irregular Firing of a Leaky Integrator Neuron Model with Partial Reset Guido Bugmann School of Computing, University of Plymouth, Plymouth PL4 8AA, UK

Chris Christodoulou Department of Electronic and Electrical Engineering, King’s College London, Strand, London WC2R 2LS, UK

John G. Taylor Department of Mathematics, King’s College London, Strand, London WC2R 2LS, UK Institut fur Medizin, Kfa-Juelich, Juelich, Germany

Partial reset is a simple and powerful tool for controlling the irregularity of spike trains fired by a leaky integrator neuron model with random inputs. In particular, a single neuron model with a realistic membrane time constant of 10 ms can reproduce the highly irregular firing of cortical neurons reported by Softky and Koch (1993). In this article, the mechanisms by which partial reset affects the firing pattern are investigated. It is shown theoretically that partial reset is equivalent to the use of a time-dependent threshold, similar to a technique proposed by Wilbur and Rinzel (1983) to produce high irregularity. This equivalent model allows establishing that temporal integration and fluctuation detection can coexist and cooperate to cause highly irregular firing. This study also reveals that reverse correlation curves cannot be used reliably to assess the causes of firing. For instance, they do not reveal temporal integration when it takes place. Further, the peak near time zero does not always indicate coincidence detection. An alternative qualitative method is proposed here for that later purpose. Finally, it is noted that as the reset becomes weaker, the firing pattern shows a progressive transition from regular firing, to random, to temporally clustered, and eventually to bursting firing. Concurrently the slope of the transfer function increases. Thus, simulations suggest a correlation between high gain and highly irregular firing. 1 Introduction It has been reported by Softky and Koch (1992, 1993) and Bugmann (1990) that the highly irregular firing of cortical neurons cannot be reproduced by a single neuron performing the temporal integration of excitatory postNeural Computation 9, 985–1000 (1997)

c 1997 Massachusetts Institute of Technology °

986

Guido Bugmann, Chris Christodoulou, and John G. Taylor

synaptic potentials (EPSP) generated by independent stochastic input spike trains. They found that only simulations with biologically unrealistic short membrane integration time constants of RC < 1 ms allowed reproduction of the observed irregularity. While irregular firing at low frequency is commonly obtainable with long-membrane time constants (see, e.g., Bugmann, 1991a), the specific problem posed by the data of Softky and Koch (1992, 1993) was the high irregularity at high frequency. This result has triggered investigations into alternative ways of producing irregular spike trains, for example, by exploiting dynamic properties of networks of spiking neurons (Usher, Stemmler, Koch, & Olami, 1994; Zipser, Kehoe, Littlewort, & Fuster, 1993) or exploiting balanced excitatory and inhibitory inputs (Shadlen & Newsome, 1994; Bell, Mainen, Tsodyks, & Sejnowski, 1995). Recently Bugmann and Taylor (1994) and Bell et al. (1995) have provided reminders that incomplete repolarization (or partial reset) of the membrane causes highly irregular firing (for older references to that approach, see Lansky & Musila, 1991). The combination of balancing excitation and inhibition and partial reset was advocated by Tsodyks and Sejnowski (1995). So far, however, it has not been shown that any of these mechanisms allows reproducing the dependence between the irregularity and firing rate shown by Softky and Koch (1993). This article concentrates on the partial reset technique and demonstrates that it enables a single neuron with a realistic membrane time constant of 10 ms to reproduce the highly irregular firing of cortical neurons. It is shown theoretically that partial reset is equivalent to the use of a time-dependent threshold, similarly to a technique proposed by Wilbur and Rinzel (1983) for producing high irregularity. This equivalent model allows establishing that temporal integration and fluctuation detection can coexist and cause irregular firing. It is shown that, on one hand, the potential of the membrane results from the progressive accumulation of input currents, while on the other hand, current peaks are the events causing firing, because the firing threshold remains slightly above the rising potential. Reverse correlation curves are analyzed for various strengths of reset in an attempt to establish the causes of firing. However, it is found that they do not reveal temporal integration, and peaks near time zero are not reliable indicators of coincidence detection. An alternative qualitative method is proposed for that later purpose. 2 Partial Reset and the Control of the Firing Irregularity Partial reset is a mechanism by which an output spike does not completely reset the membrane potential of a neuron model (Shigematsu, Akiyama, & Matsumoto, 1992). In our simulations of a leaky-integrate-and-fire (LIF) neuron (see a description of that model in Softky & Koch, 1993, and in appendix A to this article), an output spike fired at time t resets the potential of the capacitor to V(t) = βVth, where Vth is the firing threshold and β is

Temporal Integration and Fluctuation Detection

987

the reset parameter, with a value between 0 and 1. By comparing the results from Rospars and Lansky (1993) and those of Christodoulou, Clarkson, Bugmann, and Taylor (1994) who found CV > 1 when no resetting was used with the results of Softky and Koch (1993) who found CV < 1 for total reset (i.e., β = 0), it was inferred that partial reset may allow a fine control of the irregularity of the spike trains. This is confirmed in Figure 1, which illustrates that CV can be varied over a wide range by varying the reset parameter β. Partial reset in a LIF neuron most likely corresponds to partial repolarization of the somatic membrane in more detailed models (Bell et al., 1995), but it could also reflect a true but incomplete electrical decoupling between axonal trigger zone and neuronal soma (Bras, Gogan, & Tyc-Dumont, 1988) or between soma and dendrites (Rospars & Lansky, 1993) or be merely a computationally simple way to reproduce the effects of more complex dynamical properties of the membrane (see, e.g., Carpenter, 1979; Lansky & Musila, 1991). 3 Equivalence Between Partial Reset and Time-Varying Threshold A LIF neuron with constant threshold and partial reset is equivalent to a LIF neuron with total reset and a time-varying threshold. This is briefly demonstrated analytically in appendix A and is illustrated in Figure 2. After a spike is produced, partial reset initially sets the potential V(t) of the capacitor to βVth. If there is no subsequent input, the potential decays according to the curve Vo(t). If there are input spikes, the potential increase due to the inputs builds up on top of the decaying Vo(t) potential. The resulting total potential V(t) is overall relatively constant, and firing appears to be controlled by maxima of potential fluctuations, usually attributed to coincidences of input spikes. Another picture emerges when the potential Vo(t) is subtracted from V(t) and from the threshold, an operation that, according to the theory, results in an equivalent model. In the equivalent model, the potential V’(t) of the soma now shows the usual integration-and-reset cycle seen in LIF models with total reset. In contrast to the constant threshold used in the classical LIF model, the threshold is now increasing with time, according to Vth(t) = Vth − Vo(t) = Vth(1 − β exp −(t/RC)). In Figure 2, this threshold stays closely above the membrane potential while both increase almost in parallel. 4 Determinants of the Firing Time How is the firing time determined in a LIF model with partial reset ? Of special interest is a case of high irregularity and high frequency, for instance, the point T = 10 ms and CV = 0.87 in the curve for β = 0.91 in Figure 1, which is characteristic of the biological data and could not be reproduced in simulations by Softky and Koch (1993). For comparison, points with T =

988

Guido Bugmann, Chris Christodoulou, and John G. Taylor

Figure 1: Dependence of the coefficients of variation of the interspike intervals CV on the average interspike interval T, for three values of the reset parameter β. The symbols show the effects of partial reset on the irregularity of spike trains produced by an LIF neuron simulated with 1 ms integration time steps. The coefficients of variation obtained with β = 0.91 (square symbols) are very similar to those observed in cortical neurons (Figure 9 in Softky & Koch, 1993). Each point is based on 20,000 simulation time steps. As inputs, we used 50 random spike trains without refractory time. All inputs were excitatory and had the same average firing frequency, increased step by step from 150 Hz to 400 Hz, to vary the average output interspike interval T. The membrane potential was increased stepwise by 0.16 mV by each input spike assumed to be of zero width. Between spikes, the potential decreased with a decay time constant of 10 ms (13 ms in Softky & Koch, 1993). During the refractory time, integration of inputs continued, but the value of the potential was not compared with the firing threshold Vth = 15 mV, as in Christodoulou et al. (1994). The full line shows the theoretical curve for a random spike p train with discrete time steps 1t √ and a refractory time of Tr time steps: CV(T) = σ 2 (T)/T2 = 1 − α/(1 + αTr). This expression was developed assuming that in each time step, except the Tr time steps following a spike, there is a probability α of observing a spike (Bugmann, 1995). For the plotted curve and the simulations, we used 1t = 1 ms and Tr = 1 ms (causing a minimum interspike interval Tr + 1t = 2 ms). The actual average interspike interval T is related to α by T = (1 + αTr)/α.

10 ms on the curves for the two other values of β are also discussed. It can be seen in Figure 2 that the potential V0 (t) of the membrane (of the equivalent model) increases progressively, which is, in the commonly accepted sense, the result of the temporal integration of past input spikes. However, due to the small distance between firing threshold and the potential, potential

Temporal Integration and Fluctuation Detection

989

(A)

(B)

Figure 2: Membrane potentials of the LIF model with partial reset (A, gray lines) and its equivalent model with total reset (B, black lines). The curve V(t) is observed in the case β = 0.91 with an average interspike interval of Tf = 10 ms. The simulation is done assuming that a spike has just been fired so that the initial potential is βVth. The output spike train is very irregular and has a CV = 0.87. The curve Vc(t) is obtained with a continuous input current equal to the average input current in the case of the curve V(t). The time taken to reach the threshold is now Tc = 17 ms. This shows that a fluctuating input current leads to higher firing rates. For comparison, in the case β = 0.98 one finds Tc = infinity but Tf = 10 ms, and in the case β = 0, Tc = Tf = 10 ms. The curves Vo(t) and Vth(t) are explained in the text. The curve V0 (t) is calculated from V(t) and Vo(t). The input current corresponds to the total number of input spikes in each time step. The simulation time step is 0.1 ms.

990

Guido Bugmann, Chris Christodoulou, and John G. Taylor

fluctuations are more likely to cause early firing and can thus become the actual causes of firing. The importance of potential fluctuations for the firing behavior of the neuron is reflected in the fact that the average interspike interval with a fluctuating input, Tf , is shorter than the interspike interval Tc observed with a continuous input current of the same average value.1 For instance, in the case β = 0, the fluctuations seem to play no role, as Tf = Tc = 10 ms. In the case β = 0.98, when Tf = 10 ms, the potential cannot reach the firing threshold with the continuous input current (Tc = infinity); therefore, fluctuations are essential for firing. In the case β = 0.91, the fluctuations play a significant role, as most spikes are produced early (Tc = 17 ms and Tf = 10 ms). Therefore, although the potential of the neuron is the result of a standard temporal integration process, the fluctuations of that potential are determining the time of firing due to the partial reset mechanism or, equivalently, the use of a time-dependent threshold. 5 What Do Reverse Correlation Graphs Tell Us? What is the time scale over which input current fluctuations (fluctuations of the density of input spikes along the time axis) are integrated to result in potential fluctuations that cause firing? To answer such a question, one usually turns to reverse correlation graphs, which can “indicate the length of stimulus history that is relevant” (Mainen & Sejnowski, 1995). This interpretation of reverse correlation curves requires some caution (see also Shadlen & Newsome, 1995). For instance, in Figure 3, the curve for β = 0 is almost perfectly flat for times earlier than −0.5 ms, although it is established that in this case, every input spike arriving during an average of 10 ms before firing is relevant, as firing results from temporal integration of small EPSPs (Softky & Koch, 1992, 1993). An indication on the length of relevant stimulus history is provided only if the temporal organization of the inputs is important for firing. In the case β = 0, the timing is not important, and the graph does not show any feature, even if spikes are important for the firing process. In the curve for β = 0, the peak at time zero of amplitude 611 Hz is due to the fact that a LIF neuron can fire only when the instantaneous input current is above current threshold (Bugmann, 1991b). In models with EPSPs extended in time, for example, modeled with alpha functions (Bugmann, 1995), such a current level results from inputs having occurred some time in the past. However, in the model used here, where spikes cause an instantaneous increase in potential (of 0.16 mV), at least one spike must be present during a time step for firing to be possible during that time step. Therefore, the average input firing rate measured by the reverse correlation contains

1 Each input spike deposits a fixed electric charge 1Q = V ∗ C (see appendix A) in the 1 capacitor, so that a spike trains of frequency f “carries” an average current IAV = f ∗ 1Q.

Temporal Integration and Fluctuation Detection

991

Figure 3: Reverse correlation graphs for the three indicated values of the reset parameter β. The average interspike interval is T = 10 ms in all three cases. The graphs represent input firing frequencies based on number of input spikes observed in 0.1-ms time steps, at the indicated delay before and after a spike is produced. The frequencies stay at the same levels on both sides up to +15 ms and −15 ms. The fine lines indicate the level of the average input frequencies. Lower average input firing rates are needed as β increases (295 Hz for β = 0, 189 Hz for β = 0.91, and 178 Hz for β = 0.98) because partial reset removes fewer charges after each spike than total reset does. Amplitude of the peak at time zero: 611 Hz for β = 0, 546 Hz for β = 0.91, and 455 Hz for β = 0.98. Data are averaged over 10,000 output spikes (100 sec simulated operation). Decay time constant RC = 10 ms.

only time steps with one or more spikes for the bin at time zero. This results in large values of the input firing rate. For instance, if one uses only time steps with one or more spikes, the calculated frequency is 380 Hz, while the average input frequency is 295 Hz. If only time steps with two or more spikes are used, the calculations give 673 Hz. Thus, the amplitude of 611 Hz of the central peak indicates that firing mostly occurs during time steps with two or more input spikes. This sampling effect results in even higher peaks at time zero, when smaller simulation time steps are used. This effect, added to the fact that the central peak is observed for all three values of β, strongly suggests that the central peak does not have any computational meaning. For instance, for β = 0, temporal integration determines the time of firing, while in the case β = 0.98, potential fluctuations determine the time of firing.

992

Guido Bugmann, Chris Christodoulou, and John G. Taylor

In the curve for β = 0, in the three to four time steps before zero, there is a small accumulation of spikes, which indicates that, on average, a spike is produced just after a small peak in input current (due to a higher-thanaverage arrival rate of input spikes) lasting approximately 0.5 ms. In the case β = 0, such an accumulation cannot be taken as evidence that the neuron is technically a coincidence detector in the sense that only coincidences within a small time window of 0.5 ms cause firing. It merely reflects the fact that in case of a fluctuating input current, the potential of an LIF neuron usually crosses the threshold during a positive fluctuation. This fluctuation does not need to be large in the case β = 0 where the saturation potential for an equivalent continuous input current is 24 mV and the potential is rising steeply when crossing the firing threshold. Actually the comparison of Tc with Tf done earlier shows that fluctuations are not needed at all, because the neuron would fire at the same rate when fed with an average input current. In the case β = 0.91, the saturation potential is slightly above the threshold, but at time T = 10 ms, the average potential is still below threshold at 14.4 mV (starting with a reset potential βVth = 13.65 mV). Thus potential fluctuations of sufficient amplitude are required to cause the observed early firing. This is reflected in the reverse correlation curve, where the accumulation before time zero is of larger amplitude and longer duration than in the case β = 0, indicating that firing is usually preceded by current peaks lasting up to 2 ms.2 In this case where the input spikes arrive at an average rate of 0.95 per bin (0.1 ms), current peaks are not characterized by a clean current plateau of given duration. These are probably best described as clusters of spikes, two of which can be seen in Figure 2 before each of the two output spikes. In the case of partial reset, individual spikes have potentially a higher chance to cause firing because the membrane potential is close to the firing threshold. However, the width of the reverse correlation curve indicates that clusters of various duration are contributing to the firing. Thus, partial reset is not equivalent to the use of the giant EPSPs of over 10 mV proposed in Softky and Koch (1993). The earlier analysis of the firing times indicates that this neuron has probably a coincidence detection 2 To determine this value, the curve was fitted by a simple model based on the following assumptions: Input current peaks are defined as having an amplitude larger than the average current and a certain duration. The larger the amplitude, the smaller the necessary duration until firing occurs. Before the peak, the current has the average value. It is assumed that the reverse correlation curve represents an average over all peaks that can cause firing. For instance, current peaks of large amplitude contribute only to the parts of the curve near time zero. The peaks with smaller amplitude must be of longer duration and contribute also to the tail of the curve. Such a model provides a good fit to the data only if one assumes that peaks cannot be of duration larger than a given value (this value is approximately 2 ms for β = 0.91, 0.5 ms for β = 0, and 3 ms for β = 0.98). Somehow peaks of long duration and small amplitude seem not to contribute to firing. Details of the model will be given in a future publication.

Temporal Integration and Fluctuation Detection

993

function, where a coincidence is any cluster containing a sufficient number of spikes in a given time window. However, the reverse correlation curve cannot confirm this hypothesis. As seen in the case β = 0, the fact that firing usually occurs after a peak of above average input current cannot be taken as evidence that the neuron is “detecting” this type of peak. In the case β = 0, many other high current peaks are occurring at random times before firing, do not cause firing, and do not appear in the reverse correlation curve.

6 Proving Coincidence Detection To prove formally that the neuron operates as a coincidence detector, it has to be shown that most current peaks of the same size as those seen before a spike actually cause a spike. A theoretical proof is a complex task beyond the scope of this article. However, it can be checked by simulation if current peaks systematically cause firing. For that purpose, simulations were performed with two output neurons receiving the same input spikes, the first being the LIF neuron with partial reset and the second acting as a peak detector. The second LIF neuron has a very short decay time constant of 2 ms (this duration is taken from the reverse correlation curve for β = 0.91), and its potential is not reset after a spike. Due to the fast decay, potential peaks reflect the above-average rate of arrival of input spikes during a preceding time window of up to 2 ms. By placing the firing threshold of this neuron adequately (3.8 mV), its spikes become indicators of peaks in input current. Figure 4 shows that the two neurons fire spikes often at the same time or within 1–2 ms of each other. Although this is not a formal proof, it strongly suggests that the LIF neuron with partial reset β = 0.91 operate mainly as a detector of peaks of input current of duration less than 2 ms. Because such events are random, this property is the cause of values of CV close to one. This small time window of 2 ms and the similarity between the fluctuating portions of two potential curves in full line (see Fig. 4) suggest that the neuron may operate with a small effective time constant. This is, however, not the case. There is no inhibitory leak in this model that could have such an effect (Softky, 1995). Further, it is shown in appendix B that when the potential is maintained close to its saturation value by a continuous input current, a small increase in potential decays toward the saturation potential with a time constant of 10 ms. Two factors cause the similarity between the two curves: (1) The model uses a step increase of the potential for each input spike, and so there is no difference in the rising phases in both neurons; and (2) the reset cuts through the maxima in the potential of the capacitor with RC = 10 ms and moves the decreasing parts of potential below the threshold. These are of similar amplitude in both curves, indicating a similar leak current in both neurons. This is due to the difference in average potentials, despite different decay time constants.

994

Guido Bugmann, Chris Christodoulou, and John G. Taylor

Figure 4: Membrane potentials and firing times of two neurons with the same input spike trains simulated during 300 ms. The upper row of spike times and the upper membrane potential (in the full line) belong to a LIF neuron with partial reset (β = 0.91) and a membrane time constant of RC = 10 ms. The firing threshold is Vth = 15 mV. The dotted line represents the free evolution of the potential with neither spiking nor resetting. The lower row of spike times and the lower membrane potential belongs to a LIF neuron with a membrane decay time constant of RC = 2 ms and a firing threshold Vth = 3.8 mV. This neuron is not reset after a spike and is used to detect peaks of input current or, equivalently, peaks of spike input rate. Further simulation conditions as in Figure 2.

7 Temporally Clustered Firing and Neuronal Gain Values of CV larger than one are caused by a temporal clustering process (e.g., bursting) that results from a smooth transition from the peak detection regime described above into a regime where multiple spikes are generated in response to peaks (single responses to peaks can only produce values of CV up to 1). As an indicator of clustering, there is an interesting peak at approximately −2 ms in the reverse correlation curve for β = 0.98 (see Figure 3). The interpretation of this peak requires several logical steps. To start, it is not sensible to take this peak as evidence that an input current peak occurs frequently 2 ms before a spike is fired. More plausible is the assumption that the peak has caused a first spike followed by a second one 2 ms later. There is no current peak at t = 2 ms in the reverse correlation curve, indicating that the first spike is generally not followed by an increased input current 2 ms later. Thereby the accumulation of current just before time zero does not occur in the same run as the accumulation before the spike 2 ms before. The picture that emerges is that either a first spike is followed by a second spike 2 ms later, which is not caused by a peak in input current, or an isolated spike is produced by a current peak. The simplest interpretation is that a current peak produces either a single spike or a pair of spikes

Temporal Integration and Fluctuation Detection

995

separated by the minimum interspike interval (for β = 0.98, the reverse correlation curve shows only a faint indication for triplets, which becomes much stronger for β = 0.995 [not shown]). Two effects favor clustered firing: (1) the small reset enables long-lasting current peaks to cause repeated firing, and (2) a decreasing average potential favors a bimodal firing pattern. For instance, in the case β = 0.98, the reset potential is initially 0.3 mV below threshold but then the potential decreases towards the saturation potential, which is 0.8 mV below threshold. Thus, it is just after a spike that the firing conditions are the most favorable; however, the more the firing is delayed after that initial period, the higher the current peaks required and the longer the average time to the next peak.3 While the first effect causes generally clear bursts of spikes, the second causes a more discrete clustered firing. Interspike interval distribution curves (ISI) show a transition from regular, to random, to clustered, and eventually to bursting firing (Bugmann, 1995) when the value of β increases. This transition can be described as a progressive replacement of single spikes with regular interspike intervals by clusters with a larger and larger number of spikes per cluster, a process that enhances the ISIs at small intervals and depletes them at large intervals. Such a process increases the value of CV (Bugmann, 1995). Clusters can also be seen as multiple responses to favorable input events, which increase the gain of the neuron, as illustrated in Figure 5.4 The higher gain observed with high values of β can also be inferred from the small change in average potential that is required to cause a large change in firing rate (see, e.g., Figure 4). Thus, simulations suggest a correlation between high gain and highly irregular firing. 8 Summary The effects of strength of reset on the irregularity of firing can be summarized as follows: 1. With total reset, for a spike to be produced after 10 ms, the average potential must rise steeply during the interspike interval. This favors the late production of spikes with relatively regular interspike intervals. When the potential crosses the threshold, fluctuations play a minor role in determining the exact time of firing. This explains a small nonzero CV for the model with total reset.5

3 Firing is possible only after the refractory time (2 ms in this case), so the average potential at that time is relevant for fluctuation detection. 4 This is consistent with the observation that in bursting neurons, information is carried by the rate of bursts rather than the rate of spikes (Bair, Koch, Newsome, & Britten, 1994). 5 When the average potential input current approaches the threshold very slowly, fluctuations become the main cause of firing. Therefore, neuron models with total reset also show high values of CV at low frequency.

996

Guido Bugmann, Chris Christodoulou, and John G. Taylor

Figure 5: Transfer function of the LIF neuron model with various values of the reset parameter β. With β = 0 our model is equivalent to the one used in Softky and Koch (1993). The theoretical current threshold is 1.5, but due to the long integration time steps used (1 ms) the model underestimates leaks and shows a smaller threshold of 1.41. Firing below that threshold is due to fluctuations. Note that partial reset modifies the gain and not the current threshold. Simulation conditions as in Figure 1.

2. With partial reset and a reset parameter β = 0.91, the average potential remains relatively constant and close to the threshold, so that random firing results from the detection of current peaks. The amplitude of reset can be seen as a way of selecting the amplitude of the fluctuations that are able to cause firing. This also determines the average interspike interval. 3. With partial reset and a reset parameter β = 0.98, the potential is very close to the threshold after firing and repetitive firing occurs during periods of maintained high input current. Because each detected current peak can produce multiple spikes, to achieve an average interval of 10 ms, the number of detected current peaks must be small. For that purpose, a lower average potential is used, which causes long, silent periods between bursts. This bimodal firing pattern results in values of CV > 1. For a given average interspike interval, with strong reset, the duration of the interspike interval is determined by the time taken to integrate the input current. With weak reset, the random occurrence of current peaks determines the firing time. Currently, reverse correlation graphs do not

Temporal Integration and Fluctuation Detection

997

allow quantifying the contribution of each of these mechanisms in intermediate cases. While reverse correlation graphs show that current peaks of given sizes usually precede spikes, they do not indicate if the neuron was integrating current before the peak or if it was waiting for a peak of sufficient amplitude. Further analytical work is needed to determine if it is possible to extract this information. Probably a reverse correlation graph of the membrane potential would be more informative. Alternatively, useful indications can be obtained by comparing the firing times caused by a fluctuating current and by an equivalent continuous current, as done earlier in this article. In conclusion, highly irregular firing similar to the one observed in biological neurons can be produced with a simple leaky integrator model provided with a partial reset mechanism. This mechanism is a simple way to improve the realism of spike trains produced by simulated neurons. It is also a powerful parameter to control the gain of a neuron. Note that none of the mechanisms suggested by Softky and Koch (1993) is at work here: strong inhibitory leak, individual input spikes with high amplitude or strong dendritic amplification, or nonrandom synchronized inputs. By using partial reset, the temporal integration of random input spikes is exploited for maintaining the average potential of the neuron at a small distance from the threshold during the entire integration time, allowing input current fluctuations (made of small EPSPs) to cause firing at random times. Appendix A: Equivalence Between a Model with Partial Reset and a Model with Time-Dependent Threshold The demonstration has two steps. First, it is shown that the potential of the membrane in case of partial reset is the sum of a decaying reset potential Vo(t) and a potential Vs(t) due to the integration of input spikes in a neuron with total reset. Let us assume that the input current I(t) is made of very short impulses of duration 1t, each carrying a charge V1 ∗ C. It is then straightforward to show that the membrane potential V(t), which is a solution of the leaky integrator equation A.1, C

1 dV(t) =− V(t) + I(t) dt RC

(A.1)

has the following form: ¶ ¶ µ µ X t − tj t + V1 exp − V(t) = βVth exp − RC RC j = Vo(t) + Vs(t),

(A.2)

998

Guido Bugmann, Chris Christodoulou, and John G. Taylor

where tj is the time of arrival of an individual spike. When the reset is total, that is, β = 0, then V(t) = Vs(t), which shows that Vs(t) has the desired property (i.e., is the potential of a neuron with total reset). The second step is to show that the firing time is the same for the model with partial reset and constant threshold Vth and the model with total reset but varying threshold Vth(t). The neuron with partial reset fires an interval T after the last spike if the firing condition is satisfied: V(T) = Vth = Vo(T) + Vs(T).

(A.3)

This is equivalent to the firing condition describing the firing time in a model with total reset and a time-dependent threshold Vth(t): Vth(T) = Vth − Vo(T) = Vs(T).

(A.4)

Thereby it is shown that with the same input conditions, both models fire at the same time and are therefore equivalent. In the model of Wilbur and Rinzel (1983), the threshold increases with a function of shape tanh(t), while partial reset results in an exponential increase, from equations A.4 and A.2. Appendix B: Decay Time Constant for Fluctuations Let us assume that the input current is made of a continuous background Io to which a fluctuating component with average zero is added. The continuous component brings the potential of the neuron up to a saturation value Vo. A positive current fluctuation has brought the potential to Vo + 1v. At time t = 0, the current becomes equal to Io, and the potential decays. At time zero, the following equation describes the decay of the potential from a value Vo + 1v: Vo + 1V Io dV =− + . dt RC C

(B.1)

As Vo = Io ∗ R, equation B.2 simplifies to: 1V dV =− . dt RC

(B.2)

Equation B.2 describes the discharge of a capacitor with time constant RC and no external input current. If the continuous input current Io is also switched off at time zero, the capacitor would discharge at a larger rate (Vo + 1V)/RC. Acknowledgments We are thankful to Christian Lehmann, Sue McCabe, Mike Denham, and Jack Boitano for helpful discussions and comments on earlier versions of

Temporal Integration and Fluctuation Detection

999

this article. We are also very grateful to anonymous referees for their stimulating criticisms and pointers to the literature. One of us, C.C., gratefully acknowledges the support of the Leverhulme Trust for Grant F.40Z awarded to King’s College London.

References Bair, W., Koch, C., Newsome, W., & Britten, K. (1994). Power spectrum analysis of bursting cells in area MT in the behaving monkey. Journal of Neuroscience, 14, 2870–2892. Bell, A.J., Mainen, Z. F., Tsodyks, M., & Sejnowski, T. J. (1995). Balancing conductances may explain irregular cortical spiking (Tech. Rep. No. INC-9502). San Diego, CA: Institute for Neural Computation, University of California at San Diego. Bras, H., Gogan, P., & Tyc-Dumont, S. (1988). The mammalian central neuron is a complex computing device. Proc IEEE 1st Intl. Conf. on Neural Networks, San Diego, CA, 4, 123–126. Bugmann, G. (1990). Irregularity of natural spike trains simulated by an integrate-and-fire neuron. In Extended Abstract, 3rd International Symposium on Bioelectronics and Molecular Electronic Devices (pp. 105–106). Toranomon, Japan: FED. Bugmann, G. (1991a). Summation and multiplication: Two distinct operation domains of leaky integrate-and-fire neurons. Network: Computation in Neural Systems, 2, 489–509. Bugmann, G. (1991b). Neural information carried by one spike. Proc. 2nd Australian Conf. on Neural Networks (ACNN’91), Sydney, pp. 235–238. Bugmann, G. (1995). Controlling the irregularity of spike trains (Research Rep. NRG-95-04). Plymouth, UK: School of Computing, University of Plymouth. Bugmann, G., & Taylor, J. G. (1994). A top-down model for neuronal synchronisation (Research Rep. NRG-94-02). Plymouth, UK: School of Computing, University of Plymouth. Carpenter, G. A. (1979). Bursting phenomena in excitable membranes. SIAM J. Appl. Math., 36, 334–372. Christodoulou, C., Clarkson, T., Bugmann, G., & Taylor, J. G. (1994). Modelling of the high firing variability of real cortical neurons with the temporal noisy-leaky integrator neuron model. Proc. IEEE Int. Conf. on Neural Networks (ICNN’94), World Congress on Computational Intelligence (WCCI’94), Orlando, FL, pp. 2239–2244. Lansky, P., & Musila, M. (1991). Variable initial depolarisation in Stein’s neuronal model with synaptic reversal potentials. Biological Cybernetics, 64, 285–291. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Rospars, J. P., & Lansky, P. (1993). Stochastic neuron model without resetting of dendritic potential: Application to the olfactory system. Biological Cybernetics, 69, 283–294.

1000

Guido Bugmann, Chris Christodoulou, and John G. Taylor

Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 567–579. Shadlen, M. N., & Newsome, W. T. (1995). Is there a signal in noise? Current Opinion in Neurobiology, 5, 248–250. Shigematsu, Y., Akiyama, S., & Matsumoto, G. (1992). A spike-firing neural cell (SAM). Extended Abstracts, 4th Int. Symp. on Bioelectronic and Molecular Electronic Devices, Miyazaki, pp. 13–14. Softky, W. R. (1995). Simple code versus efficient code. Current Opinion in Neurobiology, 5, 239–247. Softky, W. R., & Koch, C. (1992). Cortical cells should fire regularly, but do not. Neural Computation, 4, 643–646. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neuroscience, 13, 334–350. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplification of local fluctuations causes high spike rate variability, fractal firing patterns and oscillatory local-field potentials. Neural Computation, 6, 795–836. Wilbur, A. J., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike distribution. Journal of Theoretical Biology, 105, 345–368. Zipser, D., Kehoe, B., Littlewort, G., & Fuster, J. (1993). A spiking network model of short-term active memory. J. Neuroscience, 13, 3406–3420.

Received July 14, 1995; accepted July 30, 1996.

Communicated by Anthony Zador

Shunting Inhibition Does Not Have a Divisive Effect on Firing Rates∗ Gary R. Holt Christof Koch Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA 91125, U.S.A.

Shunting inhibition, a conductance increase with a reversal potential close to the resting potential of the cell, has been shown to have a divisive effect on subthreshold excitatory postsynaptic potential amplitudes. It has therefore been assumed to have the same divisive effect on firing rates. We show that shunting inhibition actually has a subtractive effect on the firing rate in most circumstances. Averaged over several interspike intervals, the spiking mechanism effectively clamps the somatic membrane potential to a value significantly above the resting potential, so that the current through the shunting conductance is approximately independent of the firing rate. This leads to a subtractive rather than a divisive effect. In addition, at distal synapses, shunting inhibition will also have an approximately subtractive effect if the excitatory conductance is not small compared to the inhibitory conductance. Therefore regulating a cell’s passive membrane conductance—for instance, via massive feedback—is not an adequate mechanism for normalizing or scaling its output. 1 Introduction Many neuronal models treat the output of a neuron as an analog value coded by the firing rate of a neuron. Often the analog value is thought of as what the somatic voltage would be if spikes were pharmacologically disabled (sometimes called a generator potential). This has led to a class of simplified single-compartment models where the steady-state above-threshold membrane potential is computed as V=

Isyn , G

∗ It has come to our attention that a previous study also pointed out that shunting synapses close to the spike mechanism have a subtractive effect (F. Gabbiani, J. Midtgaard, and T. Knoppel, ¨ Synaptic Integration in a Model of Cerebellar Granule Cells, J. Neurophysiol., 72: 999–1009, 1994).

Neural Computation 9, 1001–1013 (1997)

c 1997 Massachusetts Institute of Technology °

1002

Gary R. Holt and Christof Koch

Figure 1: A comparison of divisive and subtractive inhibition. (A) Divisive inhibition changes the slope of the input-output relationship. In this case, f = g(V) was a linear function of V and G was varied from 10 to 70 nS in equal steps. (B) Subtractive inhibition shifts the curves by subtracting a current. Here Iinh varies from 0.08 to 0.058 nA in equal steps.

where Isyn is the synaptic current and G is the input conductance. A firing rate is then computed directly from this above-threshold membrane potential: f = g(V), where g is some monotonic function—for example, fout ∝ V 2 in Carandini and Heeger (1994) or fout = tanh(V) in Hopfield (1984). Varying G—for instance, via activation of inhibitory input with a reversal potential close or equal to the cell’s resting potential (also known as silent or shunting inhibition)—will directly affect the generator potential V in a divisive manner. A recent and quite popular model (Carandini & Heeger, 1994; Nelson, 1994) has suggested that changing G by shunting inhibition would be a useful way to control the gain of a cell; when the inhibitory input rate increases, the slope of the input-output relationship decreases (see Figure 1A), but the threshold does not change much. On the other hand, inhibition that is not of the shunting variety should have a subtractive effect on the input-output relationship. If the reversal potential of the inhibition is far from the spiking threshold, then the inhibitory synapse will act more like a current source; the cell’s conductance is not changed much, but a hyperpolarizing current is injected. This current simply shifts the input-output relationship by changing Isyn to Isyn − Iinh (see Figure 1B) where Iinh is the inhibitory current. Simplified models based on a generator potential ignore the effect of the spiking mechanism and assume that the behavior of the neuron above threshold is adequately described by the subthreshold equations. We show that because of the spiking mechanism, changing the membrane leak con-

Shunting Inhibition and Firing Rates

1003

ductance by shunting inhibition does not have a divisive effect on firing rate, casting doubt on the hypothesis that such a mechanism serves to normalize a cell’s response. A similar conclusion has been reached independently for certain classes of neuron models (Payne & Nelson, 1996). 2 Methods Compartmental simulations were done using the model described by Bernander, Douglas, Martin, and Koch (1991); Bernander, Koch, and Douglas (1994); Bernander (1993); and Koch, Bernander, and Douglas (1995). The geometry for the compartmental models was derived from a large layer V pyramidal cell and a much smaller layer IV spiny stellate cell stained during in vivo experiments in cat (Douglas, Martin, & Whitteridge, 1991) and reconstructed. Geometries of both cells are shown as insets in Figure 2. Each model has the same eight active conductances at the soma, including an A current and adaptation currents (see Koch et al., 1995, for details). The somatic conductance values were different for each cell, but the same conductance per unit area was used for each type of channel. Dendrites were passive. Simulations were performed with the program Neuron (Hines, 1989, 1993). To study the effect of GABAA synapses and shunting inhibition, we did not explicitly model each synapse; we set the membrane leak conductance and reversal potential at each location in the dendritic tree to be the timeaveraged values expected from excitatory and inhibitory synaptic bombardment at presynaptic input firing rates fE and fI (as described in Bernander et al., 1991). To compute the time-averaged values, synapses were treated as alpha functions with a given time constant and maximum conductance (g(t) = gmax te−t/τ e−1 /τ ). The area density of synapses at a given location on a dendrite was a function of the length of dendrite that separated the area from the soma (see Table 1). Two different sets of densities (“near” and “far”) were used, depending on whether inhibitory synapses were near the soma or far from the soma. The “near” configuration is identical to the distribution used by Bernander et al. (1991) and reflects the anatomical observation that inhibitory synapses are mostly located near the soma in cortical pyramidal cells. The “far” configuration is not intended to be anatomically realistic. For simplicity, we used the same number of synapses for both the spiny stellate cell and the layer V pyramidal cell models. Our integrate-and-fire model is described by C

dV = −Vgleak + Isyn dt

for V < Vth

(2.1)

where gleak is the input conductance, C is the capacitance, and Isyn is the synaptic input. When the voltage V exceeds a threshold Vth , the cell emits a spike and resets its voltage back to 0. We used C = 1 nF, gleak = 16 nS, Vth = 16.4 mV, which matches the adapted current-discharge curve of the

1004

Gary R. Holt and Christof Koch

Table 1: Parameters of Synapses in the Compartmental Models

Type

Reversal Number gmax

τ

Potential

Area Density Near

1 nS

0.5 ms 1 + tanh

GABAA −70 mV 500

1 nS

5 ms

GABAB −95 mV 500

0.1 nS 40 ms

AMPA

0 mV

4000

³ x − 40 ´ 22.73

e−x/50

Far 1 + tanh 1 + tanh

xe−x/50

³ x − 100 ´ ³ x22.73 ´ − 100 22.73

Note: x is the length of dendrite in µm that separates this compartment from the soma. The “near” area density was used for Figure 2; the “far” area density was used for Figure 4. The normalization for the area density is not included in the expression because it depends on the geometry; different neurons have different fractions of their membrane at a given distance from the soma.

layer V pyramidal cell model quite well (not shown). Results from this article are not changed if a refractory period or an adaptation conductance is added to the integrate-and-fire model. 3 Results We first describe our results and reasoning for an “anatomically correct” distribution of synaptic inhibition in relative close proximity to the cell body. Since we fail to observe silent inhibition act divisively but would like to know under which circumstances it can act so, we next investigate the anatomically less realistic situation of more distal inhibition. 3.1 Proximal Inhibition. Changing gleak does not change the slope of the current-discharge curve for integrate-and-fire cells (see Figure 2A); it primarily shifts the curves. It therefore has a subtractive rather than a divisive effect. The compartmental models’ behavior is very similar to that of the integrate-and-fire unit. For two different geometries (a layer V pyramid and a layer IV spiny stellate cell), we computed the fully adapted firing rate as a function of the excitatory synaptic input rate for various different rates of inhibitory input to synapses with GABAA receptors (see Figure 2B). The slope of the input-output relationship does not change when the GABAA input amplitude is changed; the entire curve shifts. The same effect can be observed when considering the current-discharge curves (not shown). This effect is most easily understood in the integrate-and-fire model. In the absence of any spiking threshold, V would rise until V = Isyn /gleak (see Figure 3). Under these conditions, the steady-state leak current is propor-

Shunting Inhibition and Firing Rates

80

Firing rate (Hz)

B.100

80

Firing rate (Hz)

A. 100 60 40 20 0

0

0.5 1 Current (nA)

1.5

1005

60 40 20 0

0

2

4

6 8 10 0 2 4 Excitatory synaptic input rate (Hz)

6

8

Figure 2: Changing gleak has a subtractive rather than a divisive effect on firing rates. (A) Current-discharge curves for the integrate-and-fire model, with gleak varying from 10 to 70 nS (from left to right) in steps of 10 nS. (B) Fully adapted firing rates of the two cells as a function of excitatory input rate for different inhibitory input rates. From left to right, the curves correspond to a GABAA inhibitory rate of 0.5, 2, 4, and 6 Hz. In all of these cases, the curve shifts rather than changes slope. In this case, inhibitory synapses were near the soma (“near” configuration in Table 1), as found in cortical cells.

tional to the input current. However, if there is a spiking threshold, V never rises above Vth . Therefore, no matter how large the input current is, the leak current can never be larger than Vth gleak . We can replace the leak conductance by a current whose value is equal to the time-average value of the current through the leak conductance (hIleak i = gleak hVi), and simplify the leaky integrate-and-fire unit to a perfect integrator. (Now, however, hIleak i will be a function of Isyn .) If the current is suprathreshold, the cell will still RT fire at exactly the same rate because the same charge 0 (Isyn − Ileak ) dt is deposited on the capacitor during one interspike interval T, although for the leaky integrator the deposition rate is not constant. For constant just suprathreshold inputs, hVi will be close to Vth , and hIleak i will be large. For larger synaptic input currents, the time-averaged membrane potential becomes less and less (since V has to charge up from the reset point), and therefore the time-averaged leak current decreases for increasing inputs (compare Figures 3A and 3B). It can be shown that

hIleak i =

 Isyn Vth gleak

µ

Isyn 1 + Vth gleak log(1 − Vth gleak /Isyn )

¶

if Isyn < Vth gleak otherwise.

(3.1)

For large Isyn , and even for quite moderate levels of Isyn just above Vth gleak , the lower expression is approximately equal to gleak Vth /2, independent of Isyn (see Figure 3C). Therefore, it is a good approximation to replace the leak conductance by a constant offset current. The current-discharge curve for

1006

Gary R. Holt and Christof Koch

the resulting perfect integrate-and-fire neuron is simply1 f (I) =

Isyn − hIleak i Isyn gleak . = − CVth CVth 2C

(3.2)

Shunting inhibition (varying gleak ) above threshold acts as a constant, hyperpolarizing current source, quite distinct from its subthreshold behavior. A similar mechanism explains the result for the compartmental models. Here, the voltage rises above the firing threshold, but the spiking mechanism acts as a kind of voltage clamp on a long time scale (Koch et al., 1995) so that the time-averaged voltage remains approximately constant. Furthermore, spiking conductances are so large that during a spike, any proximal synaptic conductances will be ignored. 3.2 Distal Inhibition. Distal synapses are not so tightly coupled electrically to the soma, so one might expect that distal GABAA inhibition might act divisivly. Since the spiking mechanism can be thought of as a kind of voltage clamp (Koch et al., 1995), one can study the neuron’s response by examining the current into the soma when it is clamped at the time-averaged voltage Vs (Abbott, 1991). For analysis, we simplify the dendritic tree into a single finite cable that has an excitatory synapse (conductance gE , reversal potential EE ) and an inhibitory synapse (gI and EI ) located at the other end. The cable has length l, radius r, specific membrane p conductance Gleak , intracellular resistivity Ri , and a length constant λ = r/(2Ri Gleak ). Using the cable equation, one can show that at steady state, the current flowing into the soma from the cable is Isoma = −g∞ Vs + 2g∞ e−L

gE (EE − Vs e−L ) + gI (EI − Vs e−L ) + g∞ Vs e−L , (gE + gI )(1 − e−2L ) + g∞ (1 + e−2L )

(3.3)

where L = l/λ is the electrotonic length of the cable and g∞ = π r3/2 p 2Gleak /Ri is the input conductance of the cylinder but with infinite length. (In this equation, all voltages are relative to the leak reversal potential, not to ground.) Despite the simplification involved in equation 3.3, it qualitatively describes the response of the compartmental model. First, in the absence of any cable (L = 0), this equation becomes linear in both gE and gI , and inhibition acts to subtract a constant amount from Isoma , as we have shown above. 1 This expression can also be derived from the Laurent expansion of the current discharge curve for a leaky integrator, f (Isyn ) = −gleak /C log(1 − Vth gleak /Isyn ), in terms of 1/Isyn around Isyn = ∞ (Stein, 1967).

Ileak (t) = V(t) g leak

Shunting Inhibition and Firing Rates

1

1007

1

A.

B.

Isyn=0.27 nA

0.5

Isyn=1 nA

0.5

V thg leak 0

0

100

200

0

300 0 Time (ms)

100

200

300

0.4

C. 0.2 0

0

Spikeless

V thg leak

Integrate-and-fire

V thg leak/2

0.5

1

1.5

Isyn (nA) Figure 3: Why shunting inhibition has a subtractive rather than a divisive effect on an integrate–and–fire unit. (A) The time-dependent current across the leak conductance Ileak (in nA) in response to a constant 0.5 nA current into a leaky integrate-and-fire unit with (solid line) and without (dashed line) a voltage threshold, Vth . The sharp drops in Ileak occur when the cell fires, since the voltage is reset. (B) Same for a 1 nA current. Note that Ileak with a voltage threshold has a maximum value well below Ileak without a voltage threshold. (C) Timeaveraged leak current hIleak i in nA as a function of input current, computed from the analytic expression. Below threshold, the spikeless model and the integrateand-fire models have the same Ileak , but above threshold hIleak i drops for the integrate-and-fire model because of the voltage threshold. For Isyn just greater than threshold, the cell spends most of its time with V ≈ Vth , so hIleak i is high (panel A; at threshold, Ileak = Vth gleak ). For high Isyn , the voltage increases approximately linearly with time, so V has a sawtooth waveform as shown in panel B. This means that hIleak i = (max Ileak )/2 = Vth gleak /2.

When L 6= 0, some divisive effect is expected since gI appears in the denominator. However, a subtractive effect will persist due to the term containing gI in the numerator. The reversal potential of GABAA synapses (increasing a membrane conductance to chloride ions) is in the neighborhood of −70 mV relative to ground, while the time-averaged voltage when the model neuron is spiking is around −50 mV (Koch et al., 1995). When a cell is spiking, therefore, a nonzero driving potential exists for GABAA inhibition. In the pyramidal cell model, the subtractive effect turns out to be much more prominent than the divisive effect even for quite distant inhibi-

1008

Gary R. Holt and Christof Koch

A. EI = −70 mV

B. EI = −50 mV

2 Clamp 1 Current (nA) 0 80 60 Firing Rate 40 (Hz) 20 0

0

5

10

15 0

5

10

15

Excitatory Firing Rate (Hz) Figure 4: The effect of more distant inhibition on firing rates for the layer V pyramidal cell. (A) The effect on the input-output relationship when inhibitory synapses are more distant from the soma (“far” configuration in Table 1). Inhibitory rates are 0.5 Hz (solid) and 8 Hz (dashed). In these simulations, EI , the GABAA reversal potential, had its usual value of −70 mV. Top: The voltage clamp current when the soma is clamped to −50 mV, close to the spiking threshold of the cell. Bottom: The adapted firing rate when the soma is not clamped. Very little divisive effect is visible on the firing rate; there is a slope change for firing rates less than 20 Hz, but this is too small to have a significant effect. (B) Same as A, except that EI and the reversal potential of all leak conductances were changed to −50 mV so there is no driving force behind the GABAA synapses or the membrane passive conductance. A clear change in slope for low firing rates is evident. However, even for this rather unphysiological parameter manipulation, subtraction prevails at moderate and high input rates.

tion (see Figure 4A). Both the inhibitory and excitatory synapses have been moved to more than 100 µm (which is more than 1 λ) away from the soma. To a good approximation, inhibition still subtracts a constant from both the current delivered to the soma and the firing rate of the cell. Equation 3.3 predicts that if the term containing gI is removed from the numerator, then a divisive effect might be visible. When we changed the reversal potential EI of the GABAA synapses and the leak reversal potential Eleak to −50 mV, we found that there is indeed an observable change in slope at a low firing rate (see Figure 4B). For higher firing rates, however, inhibition still acts approximately subtractively. We demonstrate this in the extreme case of moving both EI and the reversal potential associated with the leak

Shunting Inhibition and Firing Rates

1009

Isoma,1

Isoma

Isoma,2

Difference 0

0.5

1

1.5

2 gE

2.5

3

3.5

4

Figure 5: When gE is not small compared to gI , then inhibition acts more subtractively than divisively even when the IPSP reversal potential is equal to the somatic voltage. The two upper curves are Isoma as a function of gE for two different values of gI such that gI + B changes by a factor of two in equation 3.4. The lower curve is the difference between those two curves. For gE > gI + B, it is approximately constant over a large range. Units of gE are such that gI + B = 1 for the smaller value of gI .

conductance to −50 mV (also the value to which the somatic terminal of the cable is clamped). Under these conditions, equation 3.3 simplifies to Isoma (gE , gI ) = constant + A

gE gE + gI + B

(3.4)

where A and B are independent of gE and gI . Clearly if gE is small, changing gI simply changes the slope. When gE is not small, then it turns out that changing gI has an effect that is more subtractive than divisive (see Figure 5). Since in the “far” model both kinds of synapses have the same distribution, their firing rates are proportional to the conductances. With the synaptic parameters we have used (see Table 1), gE /gI = 0.8 fE /fI ; therefore, we expect to see a subtractive effect when gE > 6 Hz, and this is approximately true (see Figure 4B). 4 Discussion Divisive normalization of firing rates has become a popular idea in the visual cortex (Albrecht & Geisler, 1991; Heeger, 1992; Heeger, Simoncelli, &

1010

Gary R. Holt and Christof Koch

Movshon, 1996). It has been suggested that this is accomplished through shunting inhibitory synapses activated by cortical feedback (Carandini & Heeger, 1994; for a similar suggestion in the electric fish, see Nelson, 1994). Most discussions of shunting inhibition have assumed that the voltage at the location of the shunt is not constrained and may rise as high as necessary (e.g., Blomfield, 1974; Torre & Poggio, 1978; Koch, Poggio, & Torre, 1982, 1983). However, when the shunt is located close to the soma, the voltage at the site of the shunt cannot rise above the spike threshold. Therefore, the current that can flow into the cell through the shunting synapse is limited and at moderate rates becomes a constant offset (see Figure 3C). Under these circumstances, shunting inhibition implements a linear subtractive operation. Even if the conductance change is not located close to the soma, it may not have a divisive effect (see Figure 4A). First, when the cell is spiking, shunting inhibition is not “silent”; there is a significant driving force behind GABAA inhibition, since the somatic voltage is clamped to approximately −50 mV by the spiking mechanism, and the reversal potential for GABAA inhibition is in the neighborhood of −70 mV. Second, even if the reversal potential for GABAA and the leak reversal potential are set to −50 mV, inhibition acts divisively only if the excitatory synaptic conductance is small compared to the inhibitory conductance (see Figures 4B and 5). Large excitatory conductances are expected when the cell receives significant input (Bernander et al., 1991; Rapp, Yarom, & Segev, 1992) so the subtractive effect of large conductances is relevant physiologically. Current-discharge curves are affected as predicted by the simple integrate-and-fire model in response to inhibitory postsynaptic potential and GABA iontophoresis in motoneurons in vivo (Granit, Kernell, & Lamarre, 1966; Kernell, 1969) and cortical cells in vitro (Connors, Malenka, & Silva, 1988; Berman, Douglas, & Martin, 1992). The input-output curves for different amounts of inhibition do not diverge for larger inputs, as would be required for a divisive effect; in fact, they converge at high rates because of the refractory period (Douglas & Martin, 1990). In recordings from Limulus eccentric cells, current-discharge curves show both a slope change and a shift (Fuortes, 1959) because the site of current injection is distant from the site of spike generation.2 Rose (1977) showed that iontophoresis of GABAA onto an in vivo cortical network appeared to act divisively. Because shunting inhibition has a subtractive effect on single cells, this could possibly be caused by a network effect (Douglas, Koch, Mahowald, Martin, & Suarez, 1995). For synapses close to the spike-generating mechanism, as well as for the integrate-and-fire unit, the subtractive effect of conductance changes does

2 In this case, only the current-discharge curve was measured; the considerations in Figures 4 and 5 are not relevant.

Shunting Inhibition and Firing Rates

1011

not depend on the reversal potential of the conductance. Changing the reversal potential is equivalent to adding a constant current source in parallel with the conductance, and in a single compartment model this will obviously merely shift the current-discharge curve. Therefore, like inhibitory input, proximal excitatory input does not change the gain of other superimposed excitatory input. Our analysis assumes that synaptic inputs change on a time scale slower than an interspike interval. High temporal frequencies may be present in synaptic input currents for irregularly spiking neurons. Furthermore, our analysis assumes passive dendrites; active dendritic conductances complicate the interaction of synaptic excitation and inhibition. Although we cannot rule out that under some parameter combinations shunting inhibition could act divisively on the firing rates, we have not found such a range for physiological conditions. In combination with our integrate-and-fire and single cable models, we believe that a different mechanism is necessary to account for divisive normalization. Acknowledgments and Notes We thank the referees for helpful critical comments on this article. This research was supported by the NIMH and the Sloan Center for Theoretical Neuroscience. The compartmental models and associated programs are available from: ftp://ftp.klab.caltech.edu/pub/holt/holt and koch 1997 normalization. tar.gz. References Abbott, L. F. (1991). Realistic synaptic inputs for model neural networks. Network 2:245–258. Albrecht, D. G., & Geisler, W. S. (1991). Motion selectivity and the contrastresponse function of simple cells in the visual cortex. Vis. Neurosci. 7:531–546. Berman, N. J., Douglas, R. J., & Martin, K. A. C. (1992). GABA-mediated inhibition in the neural networks of visual cortex. Prog. Brain Res. 90:443–476. ¨ (1993). Synaptic integration and its control in neocortical pyramidal Bernander, O. cells. Unpublished doctoral dissertation, California Institute of Technology, Pasadena. ¨ Douglas, R., Martin, K., & Koch, C. (1991). Synaptic background Bernander, O., activity determines spatio-temporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA 88:1569–1573. ¨ Koch, C., & Douglas, R. J. (1994). Amplification and linearization Bernander, O., of distal synaptic input to cortical pyramidal cells. J. Neurophysiol. 72:2743– 2753. Blomfield, S. (1974). Arithmetical operations performed by nerve cells. Brain Res. 69:114–124.

1012

Gary R. Holt and Christof Koch

Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science 264:1333–1335. Connors, B. W., Malenka, R. C., & Silva, L. R. (1988). Two inhibitory postsynaptic potentials, and GABAA and GABAB receptor-mediated responses in neocortex of rat and cat. J. Physiol. 406:443–468. Douglas, R. J., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science 269:981–985. Douglas, R. J., & Martin, K. A. C. (1990). Control of neuronal output by inhibition at the axon initial segment. Neural Comp. 2:283–292. Douglas, R. J., Martin, K. A. C., & Whitteridge, D. (1991). An intracellular analysis of the visual responses of neurones in cat visual cortex. J. Physiol. 440:659–696. Fuortes, M. G. F. (1959). Initiation of impulses in visual cells of Limulus. J. Physiol. 148:14–28. Granit, R., Kernell, D., & Lamarre, Y. (1966). Algebraical summation in synaptic activation of motoneurons firing within the “primary range” to injected currents. J. Physiol. 187:379–399. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Vis. Neurosci. 9:181–197. Heeger, D. J., Simoncelli, E. P., & Movshon, J. A. (1996). Computational models of cortical visual processing. Proc. Natl. Acad. Sci. USA 93:623–627. Hines, M. (1989). A program for simulation of nerve equations with branching geometries. Int. J. Biomed. Comp. 24:55–68. Hines, M. (1993). The NEURON simulation program. In J. Skrzypek (Ed.), Neural Network Simulation Environments. Boston: Kluwer. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA 81:3088–3092. Kernell, D. (1969). Synaptic conductance changes and the repetitive impulse discharge of spinal motoneurons. Brain Res. 15:291–294. ¨ & Douglas, R. J. (1995). Do neurons have a voltage or a Koch, C., Bernander, O., current threshold for action potential initiation? J. Comp. Neurosci. 2:63–82. Koch, C., Poggio, T., & Torre, V. (1982). Retinal ganglion cells: A functional interpretation of dendritic morphology. Proc. R. Soc. Lond. B 298:227–264. Koch, C., Poggio, T., & Torre, V. (1983). Nonlinear interaction in a dendritic tree: Localization timing and role in information processing. Proc. Natl. Acad. Sci. USA 80:2799–2802. Nelson, M. E. (1994). A mechanism for neuronal gain control by descending pathways. Neural Comp. 6:242–254. Payne, J. R., & Nelson, M. E. (1996). Neuronal gain control by regulation of membrane conductance in two classes of spiking neuron models. Fifth Annual Computational Neuroscience Meeting Abstracts, p. 147. Rapp, M., Yarom, Y., & Segev, I. (1992). The impact of parallel fiber background activity on the cable properties of cerebellar Purkinje cells. Neural Comp. 4:518–533. Rose, D. (1977). On the arithmetical operation performed by inhibitory synapses onto the neuronal soma. Exp. Brain Res. 28:221–223.

Shunting Inhibition and Firing Rates

1013

Stein, R. B. (1967). The frequency of nerve action potentials generated by applied currents. Proc. R. Soc. Lond. B 167:64–86. Torre, V., & Poggio, T. (1978). A synaptic mechanism possibly underlying directional selectivity to motion. Proc. R. Soc. Lond. B 202:409–416.

Received August 19, 1996; accepted October 31, 1996.

Communicated by John Rinzel

Reduction of the Hodgkin-Huxley Equations to a Single-Variable Threshold Model Werner .M. Kistler Wulfram Gerstner J. Leo van Hemmen lnstitutfur Theoretische Physik, Physik-Department der TUMiinchen,0-85747 Garching bei Miinchen, Germany It is generally believed that a neuron is a threshold element that fires when some variable u reaches a threshold. Here we pursue the question of whether this picture can be justified and study the four-dimensional neuron model of Hodgkin and Huxley as a concrete example. The model is approximated by a response kernel expansion in terms of a single variable, the membrane voltage. The first-order term is linear in the input and its kernel has the typical form of an elementary postsynaptic potential. Higher-order kernels take care of nonlinear interactions between input spikes. In contrast to the standard Volterra expansion, the kernels depend on the firing time of the most recent output spike. In particular, a zeroorder kernel that describes the shape of the spike and the typical afterpotential is included. Our model neuron fires if the membrane voltage, given by the truncated response kernel expansion, crosses a threshold. The threshold model is tested on a spike train generated by the HodgkinHuxley model with a stochastic input current. We find that the threshold model predicts 90 percent of the spikes correctly. Our results show that, to good approximation, the description of a neuron as a threshold element can indeed be justified. 1 Introduction

Neuronal activity is the result of a highly nonlinear dynamic process that was first described mathematically by Hodgkin and Huxley (1952), who proposed a set of four coupled differential equations. The mathematical analysis of these and other coupled nonlinear equations is known to be a hard problem, and an intuitive understanding of the dynamics is difficult to obtain. Hence, a simplified description of neuronal activity is highly desirable and has been attempted repeatedly, for example, by FitzHugh (1961); Rinzel(l985); Abbott and Kepler (1990); Av-Ron, Pamas, and Segel(l991); Kepler, Abbott, and Marder (1992); and Joeken and Schwegler (1995), to mention only a few. An example of a simple model of neuronal spike dynamics is the integrate-and-fire neuron, which dates back to Lapicque (1907) Neural Computation 9,1015-1045 (1997)

@ 1997 Massachusetts Institute of Technology

1016

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

and has become quite popular in neural networks modeling (see,e.g., Usher, Schuster, and Niebur, 1993;Abbott & van Vreeswijk, 1993;Tsodyks, Mitkov, & Sompolinsky, 1993; Hopfield & Herz, 1995). In this article we aim at a generalized, and realistic, version of the integrate-and-fire model. In order to reduce the four nonlinear equations of Hodgkin and Huxley to a simplified model with a single, scalar variable u, we use the spike response method, which has been introduced previously (Gerstner, 1991,1995;Gerstner &van Hemmen, 1992).A spike is triggered at a time tf if the membrane potential u approaches a threshold 0 from below. For f > tf and tf being the most recent firing time, the membrane voltage u in the reduced description is given by an expansion of the form,

0

The following spike is generated if u ( t ) = 0 and $ u ( t ) > 0, which defines the next firing time t f , and so on. The kernels q and E have a direct neuronal interpretation. The first term v ( t - tf) in the right-hand side of equation 1.1 describes the form of a spike and the typical afterpotential following it. The action potential is triggered at t = t f , it reaches its maximum after a time 6, and it is followed by a period of hyperpolarization that extends over approximately 15 ms (see Figure la). From a different point of view, we can think of v ( t - tf) as the neuron’s response to the threshold crossing at tf. Similarly, the kernel 6 describes the response of the membrane potential to an input current 1. More precisely, t;”(cm;s) as a function of s is the linear response to a small current pulse at time s = 0 given that the neuron did not fire in the past. In biological terms, E ( ’ ) describes the form of the postsynaptic potential evoked by an input spike (see Figure lb). Since the responsiveness of the membrane to input pulses is reduced during or shortly after an action potential, the form of the postsynaptic potential depends also on the time f - tf that has passed since the last output spike. Higher-order terms in equation 1.1 describe the nonlinear interaction between several input pulses. We concentrate here on the four-dimensional set of equations proposed by Hodgkin and Huxley to describe the spike activity in the giant axon of the squid (Hodgkin & Huxley, 1952).We consider it as a well-studied standard model of spike activity, even though it does not provide a correct description of spiking in cortical neurons of mammals. The methods introduced below are, however, more general and can also be applied to more detailed neuron models, which often involve tens or hundreds of variables (Yamada, Koch, & Adams, 1989; Wilson, Bhalla, Uhley, & Bower, 1989; Traub, Wong, Miles, & Michelson, 1991; and Ekeberg et al., 1991).

Reduction of the Hodgkin-Huxley Equations

1017

1

0

5

15

10

20

1

-+

0.8

u

.-

?I

-

w

0.6

0.4 0.2

-3 3

w

0

-0.2 0

5

10

15

20

tlms (b)

Figure 1: Kernels corresponding to the Hodgkin-Huxley equations and obtained by the methods described in this article. (a) The kernel ql describes the typical shape of a spike and the afterpotential. The spike has been triggered at time t = 0. (b) The first-order kernel ti’’ describes the form of the postsynaptic potential. The solid line in (b) shows the postsynaptic potential evoked by a short pulse of unit strength that has arrived at time t = 0 while no spikes had been generated in the past. If there was an output spike At = 6.5 ms (dotted line) or A t = 10.5ms (dashed line) before the arrival of the input pulse, the response is significantly reduced. The solid line corresponds to At = co.

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

1018

2 Theoretical Framework 2.1 Standard Volterra Expansion. In order to approximate the membrane voltage u of the Hodgkin-Huxley equations, we start with a Volterra expansion (Volterra, 1959),

i

~ ( t=) ds cA1)(S)Z(f - S ) 0

eA2'(s,s')Z(t - s)Z(t - s') + . . . 0

(2.1)

0

The first-order term cA')(s) describes the membrane's linear response to an external input current Z(f), the second-order term cA2)(s,s') takes into account nonlinear interactions between inputs at two different times, and the following terms take care of higher-order nonlinearities. We would like to stress that equation 2.2 is an ansatz whose usefulness we still have to verify. It is by no means evident that the series converges reasonably quickly. 2.2 Expansion During and After Spikes. When dealing with series expansions, we have to examine their convergence. As we will see, the expansion 2.1 converges as long as the input is so weak that no spikes are generated. If a spike is triggered at time ff by some strong input, the neuronal dynamics jumps onto a trajectory in phase space that is always (nearly) the same-the action potential. During the spike, the evolution is determined by nonlinear intrinsic processes and is hardly influenced by external input. We therefore replace equation 2.1 by x

'J

+-2!

s

ds ds' ci2'(t - tf; s, s')Z(t - s)Z(t - s')

0

+. . .

(2.2)

0

Here q l ( t - tf) describes the evolution of the membrane potential during the spike and the relaxation process thereafter. In biological terms, ql ( t - tf) gives the form of a spike and the afterpotential. As in equation 2.1, the term 6;') describes the membrane's linear response to an input Z(t). Since inputs are much less effective during and shortly after a spike, the term 6;') depends on the time t - tf that has passed since the spike was triggered

Reduction of the Hodgkin-Huxley Equations

1019

at tf. The E ' S are called response kernels. The formal arguments justifying equation 2.2 are discussed in the appendix. The lower index of ci') indicates that we take into account the effects of the most recent spike only. We can systematically increase the accuracy of our description by including more and more spikes that have occurred in the recent past. Taking into account the most recent p spikes, we would have: u ( t ) = qp(t -

4. . .. , f - $)

Tx)

I] . . . . , t - $ ; s ) l ( t - s ) 0

'J J

I s, s')l(f - s)l(t - s') ds ds' cj2)(f - {,.. .,k - fp;

+-2!

0

0

+...

(2.3)

In general, qp(al,. . . , ap)is a complicated function of the p arguments. Since we expect that adaptation is the dominant effect, we make the ansatz U

(2.4) k=l

Here q1 ( a )describes a single spike and its afterpotential. A linear summation of the afterpotentials of several preceding spikes is often sufficient to describe the well-known adaptation effects of neuronal activity (Gerstner & van Hemmen, 1992; Ekeberg et al., 1991; Kernel1 & Sjoholm, 1973). We will show that for the Hodgkin-Huxley model, excellent results are achieved with equation 2.2, that is, with p = 1.In other words, only the most recent output spike affects the dynamics. This is not too surprising since the Hodgkin-Huxley equations show no pronounced adaptation effects. 2.3 Relation to Integrate-and-Fire Models. In passing we note that the integrate-and-fire model is a special case of equation 2.3 with the response kernels (Gerstner, 1995)

where uo is the voltage to which the system is reset after a spike, T is the membrane time constant, and o(.)is the Heaviside step function with @(s)= 1 for s 1 0 and 0 otherwise. Since the standard integrate-and-fire model is linear except at firing, the determination of the kernels is simple. In particular, all terms beyond 6:') vanish.

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

1020

Equation 2.5 applies to the "one-step" integrate-and-fire model without synaptic or dendritic integration, but generalizations are straightforward. We mention three types of generalization: 1. The synaptic input current is not a &pulse but is described by a broad pulse a@) where s = 0 is the time of presynaptic spike arrival. In this case, the response kernel 6;') has a finite rise time proportional to the width of the input pulse a.

2. Dendritic integration can be incorporated by a broad input pulse &. The effect is the same as in the first generalization. 3. We can include a time-dependent threshold #O + # ( a )where IJ is the time that has passed since the last output spike. Points 1 and 3 are not relevant for a comparison with the Hodgkin-Huxley model, since the equations of Hodgkin and Huxley do not describe synaptic or dendritic transmission processes either. With respect to point 3, it is easy to see that a time-dependent threshold is equivalent to changing the form of the ql-kernel. The firing condition is 00

0

Subtracting @(a) on both sides of the equation leads back to equation 1.1 with q ( a )= q I ( a )- # ( a ) . In section 4.3, we test the performance of integrate-and-fire models with and without time-dependent threshold on a scenario with a fluctuating input current and compare the results with the Hodgkin-Huxley model. 2.4 Derivation of the Kernels. For complicated neuron models described by several nonlinear differential equations, the kernels can be derived systematically by the following procedure. In order to compute the response kernel $), we linearize the equations around the constant solution u ( t ) = U with zero input I(t) = 0. The kernel 6;'' is the voltage component of the Green's function of the linearized equation. Second and higher-order kernels can be obtained analytically by more involved methods. The mathematical details are in the Appendix. We have explicitly calculated the kernels c:'), . . . for the HodgkinHuxley equations using a computer algebra system. Albeit the length of the expressions grows rather fast-the third-order kernel has about 400 simple terms-we found it more convenient, faster, and less consuming of computer memory to calculate the kernels analytically than using a numerical procedure. It is instructive, however, to see how the kernels can be obtained numerically. To get the linear kernel, we solve the Hodgkin-Huxley equations

,.A3'

Reduction of the Hodgkin-Huxley Equations

1021

numerically for an input current Z(t) = c ito(t),where it,,(t) is a short current pulse of unit strength at to, for example, ito(t)= 5-l @(t- to s/2) 8 ( s / 2t + t o ) with r << 1ms. This is an approximate &function. We denote the voltage component of the solution by u [ c it,](t)and set c,!jl’(t) = ( u [ c it,](t t o ) - ii)/ c in the limit c -+ 0 and r + 0, where U is the steady-state solution for zero input current. In a similar vein, we measure the voltage response to two current pulses so as to find c,!jz’. The improved kernel c:’) can only be found numerically. This is done in a similar way as above. We use an input current made up of two current pulses: Z(t) = c1 it, ( t ) c2 it,(t). The first pulse at tl is strong enough to generate an output spike at tf. The membrane potential after the spike serves as a reference trajectory U[CIif,]. Now we add the second pulse at time t2 and consider the difference,

+

+

su = lim c; c p o

1

{ u[cI it, + cz it2] - u[c1 it, I ] .

(2.7)

r-O

The difference S u ( t ) gives the values of the linear response kernel c;’’(a; s) on a ”diagonal line” with (a;s) = (s t Z - tf; s),

+

The procedure is repeated for variable offsets f Z - tf by changing the interval between the first and the second pulse. Since the effect of the spike at tf decreases for t >> tf, we have (2.9) where e,!jl)(s)is the analytically derived kernel introduced above. Finally, the kernel 9 is determined by solving the Hodgkin-Huxley equations numerically with input Z(t) = c ito(t).The amplitude c has been chosen sufficiently large so as to evoke a spike. The exact value of c is not important. We set q l ( t - tf) = lim,,o u[c ifJ(t) - ii for t > to where tf is the time when u[c it,](t)crosses a formal threshold 19 and ii is the constant solution with zero input. Once the amplitude c is fixed, the kernel 9 is uniquely determined as the form of an action potential with, apart from the triggering pulse, zero input current. The only free parameter is the threshold 9 , which is to be found by an optimization procedure described below. 2.5 Comparison with Wiener Expansions. The approach indicated so far shows that all kernels can be found by a systematic and straightforward procedure. The kernels can be derived either analytically, as described in

1022

Werner M. Kistler, Wulfram Gerstner, and J.Leo van Hemmen

the appendix, or numerically by studying the system’s response to input pulses on a small set of examples. This is in contrast to the Wiener theory (Wiener, 1958; Palm & Poggio, 1977),which analyzes the response of a system to gaussian white noise. Since Wiener’s approach to the description of nonlinear systems is a stochastic one, the determination of the Wiener kernels requires large sets of input and output data (Palm & Poggio, 1978; Korenberg, 1989). Exploiting the deterministic response of the system to well-designed inputs (short pulses) thus simplifies things significantly.The study of impulse response functions is a well-known approach to linear or weakly nonlinear systems. Here we have extended this approach to highly nonlinear systems under the proviso that the response kernels 6 are (almost everywhere) continuous functions of their arguments. It is important to keep in mind that Volterra and Wiener seriescannot fully reproduce the threshold behavior of spike generation, even if higher-order terms are taken into account. The reason is that these expansions can only approximate mappings that are smooth, whereas the mapping from input current I ( t ) to the time course of the membrane voltage has an apparent discontinuity at the spiking threshold (see Figure 2). Of course, this is not a discontinuity in the mathematical sense, but the output is very sensitive to small changes in the input current (Cole,Guttman, & Bezanilla, 1970; Rinzel & Ermentrout, 1989).We have to correct for this by adding the kernel q each time the series expansion indicates that a spike will occur, say, by crossing an appropriate threshold value. In doing so, we no longer expand around the constant solution u ( t ) = 0, but around u ( t ) = q(t - tf), the time course of a standard action potential. The general framework outlined in this section will now be applied to the Hodgkin-Huxleyequations. 3 Application to Hodgkin-Huxley Equations

Applying the theoretical analysis to the Hodgkin-Huxley equations, we begin by specifying the equations and then explain how the reduction to the spike response model is performed.

3.1 Hodgkin-Huxley Spike Trains. According to Hodgkin and Huxley (Hodgkin & Huxley, 1952;Cronin, 1987),the voltage u ( t )across a small patch of axonal membran ( e g , close to the hillock) changes at a rate given by

where I ( t ) is a time-dependent input current. The constants U N ~ U, K , and U L are the equilibrium potentials corresponding to sodium, potassium, and ”leakagecurrents,” and the 8’s are parameters of the respective ion conduc-

1023

Reduction of the Hodgkin-Huxley Equations

10

15

20

25

30

tlms Figure 2: Threshold behavior of the Hodgkin-Huxley equations (see equations 3.1 and 3.2). We show the response of the Hodgkin-Huxley equations to a current pulse of l ms duration. A current amplitude of 7.0 p A cm-* suffices to evoke an action potential (solid line; the maximum at 100 mV is out of scale), whereas a slightly smaller pulse of 6.9 PA cm-2 fails to produce a spike (dashed line). The time course of the membrane voltage is completely different in both cases. Therefore, the mapping of the input current onto the time course of the membrane voltage is highly sensitiveto changes in the input around the spiking threshold. The bar in the upper left indicates the duration of the input pulse.

tances. The variables m, n, and h change according to a differential equation of the form dx

- = (rx(v)(1 - x ) - bx(u)x

dt

(3.2)

with x E ( m ,n,h). The parameters are given in Table 1. For a biological neuron that is part of a larger network of neurons, the input current is due to spike input from many presynaptic neurons. Hence the input current is not constant but fluctuates. In our simulations, we therefore use an input current generated by the following procedure. Every 2 ms, a random number is drawn from a gaussian distribution with zero mean and standard deviation u . The discrete current values at intervals of 2 ms are linearly interpolated and define the input I(t). This approach is intended to mimic the effect of spike input into the neuron. The procedure is somewhat arbitrary but easily implementable and leads to a spike train with a broad interval distribution and realistic firing rates depending on u , as shown in Figure 3. We emphasize that our single-variable model is intended to reproduce the firing times of the spikes generated by the Hodgkin-Huxley equations.

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

1024

Table 1: Parameters of the Hodgkin-Huxley Equations (Membrane Capacity, C = 1pF/cm2) X

vx

81

Nu K L

115mV -12mV 10.6mV

120mS/cm2 36mS/cm2 0.3mS/cm2

~

~~

n m h

(0.1 - 0.01 v)/ [exp(l - 0.1 V ) - 11 (2.5 - 0.1 v)/ Iexp(2.5 - 0.1 U ) - 1) 0.07exp(-v/ 20)

0.125 exp(-v / 80) 4exp(-v/ 18) 1/[exp(3-0.1v)+ll

This is a harder problem than fitting the gain function for constant input current. 3.2 Approximation by a Threshold Model. We want to reduce the four Hodgkin-Huxley equations (3.1 and 3.2) to the spike response model (see equation 2.2) in such a way that the model generates a spike train identical with or at least similar to the original one of Figure 3a. The reduction will involve four major steps:

1. Calculate the response functions cA1),cA2),and q,(3). 2. Derive the kernel

r]l.

3. Analyze the corrections to output spike.

that are caused by the most recent

4. Determine a threshold criterion for firing. Before starting, we note that for I ( t ) = f, the Hodgkin-Huxley equations allow a stationary solution with constant values v ( t )= 5,m ( t ) = Z, h ( t ) = h, and n ( t ) = Ti. The constant solution is stable if 1is smaller than a threshold value f , ~= 9.47pA/cm2, which is determined by the parameter values given in Table 1.

3.2.2 Standard Volterra Expansion. The kernels cA1)(sl) and cA2)(s1,s2) have been derived as indicated in the appendix and are shown in Figures l b (solid line) and 4. Figure 5 shows a small section of the spike train of Figure 3a with a single spike. Two intervals, one before (lower left) and a second during and immediately after the spike (lower right), are shown on an enlarged scale. Before the spike occurs, the numerical solution of the Hodgkin-Huxley equations (solid) is well approximated by the

1025

Reduction of the Hodgkin-Huxley Equations

200

4 10

600

time (4

/

ms

900.-

600v)

42

c

:

300-

25

50

75

100

IS1 / ms Figure 3: Hodgkin-Huxley model. (a) 1000 ms of a simulation of the HodgkinHuxley equations stimulated through a fluctuating input current I ( t ) with zero mean and standard deviation 3pA/cm2. Because of the fluctuating input, the spikes occur at stochastic intervals with a broad distribution. (b)A histogram of the interspike interval (ISI) sampled over a period of 100 s.

first-order Volterra expansion u ( t ) = I d s cA1)(s)I ( t - s) (long dashed line) and even better by the second-order approximation (dotted line). During a spike, however, even an approximation to third order fails completely (see Figure 6). The discrepancy during a spike may be surprising at first sight but is not completely unexpected. During action potential generation, the neuronal dynamics is mainly driven by internal processes of channel opening and closing, much more than by external input. Mathematically, the reason is

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

1026

-

20

r 4

0

I .

- 0

15 vl

E

.

\

10

Y

5 0 0

5

10

15

20

I

s/ms Figure 4 Second-order kernel for the Hodgkin-Huxley equations. The figure exhibits a contour plot for the second-orderkernel cA2)(s,s’). The kernel vanishes identically for negative arguments, as is required by causality. that the series expansion is valid only as long as I ( t ) is not too large; see the Appendix.

3.2.2 Expansion lncluding Action Potentials. In order to account for the remaining differences, we have to add explicitly the shape of the spike and the afterpotential by way of the kernel r,q(s). The function ql(s) has been determined by the procedure explained in section 2 and is shown in Figure la. If we add ~1(t -t f ) but continue to work with the standard kernels q,(1) (s), q,(2) (s, s’), .. . we find large errors in an interval up to roughly 20 ms after a spike.This is exemplifiedin the lower right of Figure 5 where the longdashed line shows the approximation u(t) = q l ( t - tf) + Jds chl)(s) l ( t - s). From a biological point of view, this is easy to understand. The kernel c f ) ( s ) describes a postsynaptic potential in response to a standard input spike. Due to refractoriness, the responsiveness of the membrane is lower after a spike, and for this reason the postsynaptic potential looks different if an input pulse arrives shortly after an output spike. We then have to work with the improved kernels E ~ ) ( c s), ; ci2)(o;s, s’), . . . introduced in equation 2.2. The dotted line in the lower-right graph of Figure 5 gives the approximation u ( t ) := r,q(t - t j ) + Jds ci’)(t - tf; s) l(t - s), and we see that the fit is nearly perfect.

3.2.3 Threshold Criterion. The trigger time tf of an action potential is given by a simple threshold crossing process. Although the expansion in terms of the E kernels will never give the perfect form of the spike, the

1027

Reduction of the Hodgkin-Huxley Equations

time I ms

>

Figure 5: Systematic approximation of the Hodgkin-Huxley model by the spike response model (see equation 2.2). The upper graph shows a small section of the Hodgkin-Huxley spike train of Figure 3a. The lower left diagram shows the solution of the Hodgkin-Huxley equations (solid) and the first (dashed) and second (dotted) order approximation with the kernels 6:;’ and 6;’’ before the spike occurs on an enlarged scale. The lower right diagram illustrates the behavior of the membrane immediately after an action potential. The solution of the Hodgkin-Huxley equations (solid line) is approximated by u ( t ) = q l ( t t ’ ) + Sdsc:;’(s)I ( t - s), a dashed line. Using the improved kernel 6 ; ’ ’ instead of 6:;’ results in a nearly perfect fit (dotted line).

truncated series expansion does exhibit rather significant peaks at positions where the Hodgkin-Huxley model produces spikes (seeFigure 6). We therefore decided to apply a simple voltage threshold criterion instead of using some other derived quantity such as the derivative v or an effective input current (Koch, Bernander, & Douglas, 1995).Whenever the membrane potential in terms of the expansion (see equation 2.3) reaches the threshold 29 from below, we define a firing time tf and add a contribution q ( t - tf). The threshold parameter 29 is considered as a fit parameter and has to be optimized as described below.

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

1028

I

4

,

I--

-

2 o 215

u . 220

I 225

time

230

/

\.

235

1

240

ms

Figure 6: Approximation of the Hodgkin-Huxley model by a standard Volterra expansion (see equation 2.1) during a spike. Here we demonstrate the importance of the kernel q1 as it appears in equation 2.2. First- (long dashed) and second-order approximation (dashed) using the kernels and cf' produces a significant peak in the membrane potential where the Hodgkin-Huxley equations generate an action potential (solid line), but even the third-order approximation (dotted) fails to reproduce the spike with a peak of nearly 100 mV (far out of scale). The remaining difference is taken care of by the kernel q l .

~61'

4 Simulation Results

We have compared the full Hodgkin-Huxley model and the spike response model(seeequation2.2) inasimulationof spikeactivityover 100,000ms (i.e., 100 s).In so doing, we have accepted a spike of the spike response model to be coincident with the corresponding spike of the Hodgkin-Huxley model if it arrives within a temporal precision of f 2 ms. 4.1 Coincidence Measure. As a measure of the rate of coincidence of spikes generated by the Hodgkin-Huxley equations and another model such as equation 2.3, we define an index r that is the number of coincidences minus the number of coincidences by chance relative to the total number of spikes produced by the Hodgkin-Huxley and the other model. More precisely, r is the fraction

Here Ncoincis the number of coincident spikes of Hodgkin-Huxley and the other model as counted in a single simulation run, NHHthe number of action potentials generated by the Hodgkin-Huxley equations, and N ~ R the M

Reduction of the Hodgkin-Huxley Equations

1029

number of spikes produced by model, say, equation 2.2. The normalization factor N restricts r to unity in the case of a perfect coincidence of the other model’s spike train with the Hodgkin-Huxley one (Ncoinc= Nsm = NHH). Finally, (Ncoinc) is the average number of coincidences obtained if the spikes of our model were generated not systematically but randomly by a homogeneous Poisson process. Hence r x 0, if equation 2.2 would produce spikes randomly. The definition 4.1 is a modification of an idea of Joeken and Schwegler (1995). In order to calculate (Ncoinc), we perform a gedanken experiment. We are given the numbers NHHand N s m and divide the total simulation time into K bins of 4 ms length each. Due to refractoriness, each bin contains at most one Hodgkin-Huxley spike; we denote the number of these bins by NHH. So ( K - NHH)bins are empty. We now distribute N s m randomly generated spikes among the K bins, each bin receiving at most one spike. A coincidence occurs each time a bin contains a Hodgkin-Huxiey spike and a random spike. The probability of encountering Ncoinccoincidences is thus hypergeometrically distributed, that is, p(Ncoinc) = (Ns~!~omc) / with mean (Ncoinc) = NSRM NHH/ K. To see why, we use a simple analogy. Imagine an urn with NHHblack balls and (K - NHH)white balls, and perform a random sampling of N s balls ~ without replacement. The number of black balls drawn from the urn corresponds to the number of coincidences in the original problem. This setup is governed by the hypergeometric distribution (Prohorov & Rozanov, 1969;Feller, 1970).In passing, we note that dropping the refractory side condition leads to a binomial distribution and, hence, to the same result for (Ncoinc). Using the correct normalization, N = 1 - NHH/ K, we thus obtain a suitable measure: r has expectation zero for a purely random spike train, yields unity in case of a perfect coincidence of the spike response model’s spike train with the Hodgkin-Huxley one, and is bounded by -1 from below. A negative r hints at a negative correlation. Furthermore, r is linear in the number of coincidences and monotonically decreasing with the number of erroneous spikes. That is, r decreases with increasing NsRM while Ncoincis kept fixed or with decreasing Ncoincwhile N,, is kept fixed. While simulating the spike response model, we have found that due to the long-lasting afterpotential, a single misplaced spike causes subsequent spikes to occur with high probability at false times too. Since this is not a numerical artifact but a well-known and even experimentally observed biological effect (Mainen & Sejnowski, 1995),we have adopted the following procedure in order to eliminate this type of problem from the statistics. Every time the spike response model fails to reproduce a spike of the HodgkinHuxley spike train, we note an error and add the kernel q at the position where the original Hodgkin-Huxley spike occurred. Analogously, we count an error and omit the afterpotential if the spike response model produces a spike where no spike should occur. In case of a coincidence between the

(2:c)

(Nsk),

1030

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

Hodgkin-Huxley spike and a spike of the spike response model, we take no action; we place the ‘I kernel at the position suggested by the threshold criterion. Without correction, I- would decrease by at most a few percentage points, and the standard deviation would increase by a factor of two. 4.2 Low Firing Rate. The results of the simulations for low mean firing rates are summarized in Table 2. Due to the large interspike intervals, effects from spikes in the past are rather weak. Hence we can ignore the dependence of the c kernels on the former firing times and work with cA1)(s)instead of ci”(a; s). We do, however, take into account the kernel ql(s). Higherorder kernels co(2),co(3) give a significant improvement over an approximation with only. This reflects the importance of the nonlinear interaction of input pulses. In a third-order approximation, the spike response model reproduces 85 percent of the Hodgkin-Huxley spikes correctly (see Table 2). 4.3 High Firing Rate. At higher firing rates, the influence of the most recent output spike is more important than the nonlinear interaction between input pulses. We therefore use the kernel cil) but neglect higher-order terms with ei2), cj3), . . . in equation 2.2. Figure 7 gives the results of the simulations with different mean firing rates. As for Table 2, the threshold 19 and the time S between the formal threshold crossing time tf and the maximum of the action potential have been optimized in a separate run over 10 s. For this optimization run we have used an input current with (T = 3 pA/cm2, corresponding to a mean firing rate of about 33 Hz; the maximum mean firing rate is in the range of 70 Hz. For firing rates above 30 Hz, the single-variable model reproduces about 90 percent of the Hodgkin-Huxley spikes correctly. This is quite remarkable since we have used only the first-order approximation,

J

and neglected all higher-order terms. If we neglected the influence of the last spike, we would end up with a coincidence rate of only 73 percent, even in a second-order approximation using kernels ch’) and c 8 ’ . Closer examination of the simulation results for various mean firing rates reveals a systematic deviation of the mean firing rate of the spike response model from the mean firing rate of the Hodgkin-Huxley equations. More precisely, the spike response model generates additional spikes where no spike should occur, and it misses fewer spikes if the standard deviation of the input current is increased-which is quite natural (Mainen & Sejnowski, 1995). We have performed the same test with two versions of traditional inte-

Reduction of the Hodgkin-Huxley Equations

1031

Table 2: Simulation Results of the Spike Response Model Using the Volterra Kernels 661'. . . . ,E,,(3)

First Second Third

4.00 4.70 5.80

2.85 91 f 8 71 f 7 2 O f 5 1 6 f 3 0.788f0.030 2.75 87% 11 74f8 1 3 f 3 1 3 f 2 0.845f0.016 2.50 88f10 7 6 f 8 1 3 f 3 11 f 3 0.858f0.019

Note: The table gives the number of spikes produced by the spike response model (NsRM), the number of coincidences (Ncoinc),the number of spikes produced by the spike response model where no spike should occur (Nwrong),and the number of missing spikes (Nrnisd). The numbers give the average and standard deviation after 10 tuns with different realizations of the input current and a duration of 10 s per tun. The numbers should be compared with the number of spikes in the Hodgkin-Huxley model, NHH = 87 f 8. The coincidence rate r has been defined by equation 4.1.For random gambling we would obtain r zz 0. The parameters (9 (threshold) and 6 (time from threshold crossing to maximum of the action potential) have been determined by optimizing the results in a separate run over 10 s.

grate-and-fire models: a simple integrate-and-fire model with reset potential u0 and time constant r , and an integrate-and-fire model with timedependent threshold derived from the rpkernel found by the methods described in section 2.4--29(a) = -ql(a) for CJ > 5ms and B ( a ) = 00 for CJ < 5 ms. The results are summarized in Figure 8. We have optimized the parameters r , 190, and uo with a fluctuating input current as already explained and found optimal performance for a time constant of r = 1ms. This surprisingly short time constant should be compared to the time constants of the 6;') kernel. Indeed, if the time since the last spike is shorter than about 10 ms (see Figure lb), the 6;'' kernel approaches zero within a few milliseconds. The optimized reset value was uo = -3 mV and the threshold 190 = 3.05mV respectively 00 = 2.35mV for the model with respectively without time-dependent threshold. 4.4 Response to Current Step Functions. Constant input currents have been a paradigm to neuron models ever since Hodgkin and Huxley (1952), although a constant current is very far from reality and one may wonder whether it offers a sensible criterion for the behavior of neurons producing spikes and, thus, responding to fluctuating input currents. Nevertheless, we want to study the response of our model to current steps and compare it with that of the original Hodgkin-Huxley model. As a consequence of the oscillatory behavior of the first-order kernel E;') (see Figure lb), our model exhibits a phenomenon known as inhibitory re-

1032

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

11.0

2.25

2.5

input standard deviation o / pA cm-2 Figure 7: Simulation results for different mean firing rates using the first-order improved kernel 6 ; ’ ’ only. The black bars indicate the number of spikes (leftaxis) produced by the Hodgkin-Huxley model; the neighboring bars correspond to the number of spikes of the spike response model. The gray shading reflects the coincident spikes. All numbers are averages of 10 runs over 10 s with different realizations of the input current. The mean firing rate varies between 23 and 40 Hz. The error bars are the corresponding standard deviations. The line in the upper part of the diagram indicates the coincidence rate r (right axis) as defined by equation 4.1.The parameters 19 = 4.7mV and 6 = 2.15ms have been optimized in a separate run with an input current with standard deviation o = 3 p A cm-’.

bound. Suppose we have a constant inhibitory (i.e., negative) input current for time t < 0, which is turned off suddenly at t = 0. We can easily calculate the membrane potential from equation 1.1 if we assume that there are no spikes in the past. The resulting voltage trace exhibits a significant oscillation (see Figure 9). If the current step is large enough, the positive overshooting triggers a single spike after the release of the inhibition. Since equation 1.1is linear in the input, the amplitude of the oscillatory response of the membrane potential in Figure 9 is proportional to the height of the current step but independent of the absolute current values before or after the step. This is in contrast to the Hodgkin-Huxley equations, where amplitude and frequency of the oscillations depend on both step height and absolute current values (Mauro, Conti, Dodge, & Schor, 1970). The limitations of a voltage threshold for spike prediction can be investi-

Reduction of the Hodgkin-Huxley Equations

L

I

-

1033

i i&f*

SRM

Figure 8: Comparison of various threshold models. The diagram shows the coincidence rate r defined in section 4.1 for a simple integrate-and-fire model with time constant r = 1ms and reset potential ug = -3 mV (i&f),an integrateand-fire model with time-dependent threshold (i&f*),and the spike response model (SRM).The error bars indicate the standard deviation of 10 simulations over 10 s each with a fluctuating input current with u = 3wA cm-*.

gated by systematically studying the response of the model to current steps (see Figure 10). For t < 0 we apply a constant current with amplitude 11. At t = 0 we switch to a stronger input 12 > 11. Depending on the values of the final input current 12 and the step height AI = 12 - 11 we observe three different modes of behavior in the original Hodgkin-Huxley model, as is well known (Cronin 1987).For small steps and low final input currents, the membrane potential exhibits a transient oscillation but no spikes (inactive phase 1). Larger steps can induce a single spike immediately after the step (single-spike phase S).Increasing the final input current further leads to repetitive firing (R). Note that repetitive firing is possible for I > 6 PA cm-’, but firing must be triggered by a sufficiently large current step AI. Only for currents larger than 9.47 FA/cm2 is there autonomous firing independent of the current step. We conclude from the step response behavior that there are two different threshold paradigms: a single-spike threshold (dashed line in Figure 10) and a threshold for repetitive firing (solid line). We want to compare these results with the behavior of our threshold model. The same set of step response simulations has been performed with the spike response model. Comparing the threshold lines in the (12, AI) diagram for the Hodgkin-Huxley model and the spike response model, we can state a qualitative correspondence. The spike response model exhibits the same three modes of response as the Hodgkin-Huxley equations. Let us have a closer look at the two thresholds: the single-spike threshold and the repetitive firing threshold. The slope of the single-spike threshold (dashed

1034

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

-1"

0

-5

5

10

15

20

25

3

t/ms Figure 9: Response of the spike response model to a current step. An inhibitory input current that is suddenly turned off produces a characteristic oscillation of the membrane potential. The overshooting membrane potential can trigger an action potential immediately after the inhibition is turned off. The amplitude of the oscillation is proportional to the height of the current step. Here, the current is switched from -1 pA cm-' to 0 at time t = 0.

line in Figure 10)is far from perfect if we compare the lower and the upper parts of the figure. The slope is determined by the mean Jds chl)(s) of the linear response kernel and is therefore fixed as long as we stick to the membrane voltage as the relevant variable used in applying the threshold criterion. We now turn to the threshold of repetitive firing. The position of the vertical branch of the repetitive-firing threshold (solid line in Figure 10) is shifted to lower current values as compared to the Hodgkin-Huxley model. Consequently, the gain function of the spike response model is shifted to lower current values too (see Figure 11).The repetitive-firing threshold is directly related to the (free) threshold parameter 1.9 and can be moved to larger current values by increasing 19. Using 19 = 9 mV instead of I.9 = 4.7mV results in a reasonable fit of the Hodgkin-Huxley gain function (see Figure 11).However, a shift of I.9 would also shift the single-spike threshold of Figure 10 to larger current values, and the triggering of single spikes at low currents would be made practically impossible. The value for the threshold parameter d = 4.7mV found by optimization with a fluctuating input as described in section 4.3 is therefore a compromise. It follows from Figure 10 that there is not a strict voltage threshold for spiking in the Hodgkin-Huxley model. Nevertheless, our model qualitatively exhibits all relevant phases, and, with the fluctuating input scenario, there is even a fine quantitative agreement.

,A1)

1035

Reduction of the Hodgkin-Huxley Equations

lime / ms

6 100 50

'E4.

S

Q, 1

> 6 Em

.

0

------__

7 +-

g2.

*

\

I 3 O

0

50 time / m\

100

R ,:

F=i 0

0-

50

100

current I :/ FA c i '

;":1 I

0' 0

50

Id0

_____

6

I

\

84.

-. a

4,

\

\

100

*

50

\ \s

0

R

\\

\

g2.

0 100 50

I 0time / m%

0

0

50

100

current I ?IpA cm-

Figure 10: Comparison of the response to current steps of the Hodgkin-Huxley equations (upper graph) and the spike response model (lower graph). At time t = 0 the input current is switched from II to 4, and the behavior of the models is studied in dependence on the final current Z2 and the step height AI = 12 I,. The figures in the center indicate in the manner of a phase diagram three different regimes: (I) the neuron remains inactive; (S) a single spike is triggered; (R) repetitive firing is induced. We have chosen four representative pairs of values of Iz and A I (marked by dots in the phase diagram) and plotted the corresponding time course of the membrane potential on the left and the right of the main diagram so that the responses of Hodgkin-Huxley model and spike response model can be compared. The parameters for the spike response model are the same as those detailed in the caption to Figure 7.

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

1036

125 -’ 100-

2

75-

1

% 25 0

I

;

1

I

I

i

Figure 11: Comparison of the gain function (output firing rate for constant input current I as a function of I) of the spike response and the Hodgkin-Huxleymodel. The threshold parameter B = 4.7mV found by the optimization procedure described in section43givesa gain functionof the spike response model (dashed line) that is shifted toward lower current values as compared to the gain function of the Hodgkin-Huxley model (solid line). Increasing the threshold parameter to B = 9 mV results in a reasonable fit (dotted line).

5 Discussion In contrast to the complicated dynamics of the Hodgkin-Huxley equations, the spike response model in equation 2.2 has a straightforward and intuitively appealing interpretation. Furthermore, the transparent structure of the model has a great number of advantages. The dynamical behavior of a single neuron can be discussed in simple but biological terms of threshold, postsynaptic potential, refractoriness, and afterpotential. As a first example, we discuss the refractory period. Refractoriness of the Hodgkin-Huxley model leads to two distinct effects. On the one hand, we have shunting effects-a reduced responsiveness of the membrane to input spikes during and immediately after the spike-that is related to absolute refractoriness and is expressed in our model by the first argument of ci”. In addition, the kernel 71 induces an afterpotential corresponding to a relative refractory period. In the Hodgkin-Huxley model, the afterpotential corresponds with hyperpolarization and lasts for about 15 ms, followed by a rather small depolarization. A different form of the afterpotential with a significant depolarizing phase would lead to intrinsic bursting (Gerstner & van Hemmen, 1992).So the neuronal behavior is easily taken care of by the spike response model. As a second example, let us focus on the temperature dependence of the various kernels. Temperature can be included in the Hodgkin-Huxley equa-

Reduction of the Hodgkin-Huxley Equations

1037

tions by multiplying the right-hand side of equation 3.2 by a temperaturedependent factor { (Cronin, 1987),

< = exp [ln(3.0) (T - 6.3) / 10.01,

(5.1)

with the temperature T in degrees Celsius. Equation 3.1 remains unmodified. In Figure 12 we have plotted the resulting c;’) and q1 kernels for different temperatures. Although the temperature correction affects only the equations for m, h, and n, the overall effect can be approximated by a time rescaling, since the form of both kernels is not modified but stretched in time with decreasing temperature. Apart from the transparent structure, the single-variable model is open to a mathematical analysis that would be inconceivable for the full HodgkinHuxley equations. Most important, the collective behavior of a large network of neurons can now be predicted if the form of the kernels q and c ( l ) is known (Gerstner & van Hemmen, 1993; Gerstner, 1995; Gerstner, van Hemmen, & Cowan, 1996).In particular, the existence and stability of collective oscillations can be studied by a simple graphical construction using the kernels and c ( l ) .It can be shown that a collective oscillation is stable if the sum of the postsynaptic potentials is increasing at the moment of firing. Otherwise it is unstable (Gerstner et al., 1996).Furthermore, it can be shown that in a fully connected and homogeneous network of spike response neurons with a potential described by equation 1.1, the state of asynchronous firing is almost always unstable (Gerstner, 1995).Also in simulations of a network of Hodgkin-Huxley neurons, a spontaneous breakup of the asynchronous firing state and a small oscillatory component in the mean firing activity have been observed (Wang, personal communication, 1995). In summary, we have used the Hodgkin-Huxley equations as a wellstudied reference model of spike dynamics and shown that it can indeed be reduced to a threshold model. Our methods are more general and can also be applied to more elaborate models that involve many ion currents and a complicated spatial structure. Furthermore, an analytic evaluation of some of the kernels (see the appendix for mathematical details) is possible. We have also presented a numerical algorithm to determine the response kernels that, in contrast to the Wiener method, can be determined quickly and easily. In fact, the spike response method we have used in our analysis can be seen as a systematic approach to a reduction of a complex system to a threshold model.

Appendix In this appendix we treat Volterra’s theory (Volterra, 1959)in the context of inhomogeneous differential equations, indicate under what conditions and how the kernels of the Volterra series can be computed analytically, and isolate the mathematical essentials of our own approach.

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

1038

2

3

4

8

6

10

12

0.

-

0. 0.

-0

w

0.

-0. 0

5

10

15

20

tlms (b)

Figure 12: Temperature dependence of the kernels corresponding to the Hodgkin-Huxley equations. Both ql kernel (a) and 6;’) kernel (b)show apart from a stretching in time no significant change in form if temperature is decreased from 15.0 (dotted line) and 10.0 (dashed line) to 6.3 (solid line) degrees Celsius.

A.l Volterra Series. We start by studying the ordinary differential equation

where j is a given function of time and X denotes differentiation of x ( t ) with respect to t. We require that for j = 0, x = 0 should be an attractive fixed point with a sufficiently large basin of attraction so that the system returns

Reduction of the Hodgkin-Huxley Equations

1039

to equilibrium if the perturbation j ( t ) is turned off. Furthermore, we require f and j to be continuous and bounded and f to fulfill a Lipschitz condition. Finally, j E C2(R)should have compact support so that x E C2(R). Under these premises there exists a unique solution x, with x ( -m) = 0 for each input j E A c C2(R).Hence, the time course of the solution x is a function' of the time course of the perturbation j, formally,

F : A c C2(R) += C2(R), j H F[j] = x with x(t) - f ( x ( t ) ) = j ( t ) , V t E R. (A.2) Let us suppose that the function F is analytic in a neighborhood of the constant solution x = 0. For a precise definition of the notion of analyticity and conditions under which it holds, we refer to Thomas (1996). In case of analyticity there is a Taylor series (Dieudonnk, 1968; Dunford & Schwartz, 1958; Kolmogorov & Fomin, 1975; Hille & Phillips, 1957) for F,

F[jl = 0

1 + F'[Ol[jl + -F"[Ol[j, jl + . . . 2!

(A.3)

This series is convergent with respect to the II.112-norm, that is, F[j](t) = x ( t ) for almost all t E R. Each derivative F(")[O]in equation A.3 is a continuous nlinear mapping from (C2(R))"to C2(R).Hence, F'"'[O][.](t): (C2(IR))"-+ R is a continuous n-linear functional for almost all (fixed) t E R. We know that every such functional has a Volterra-like integral representation (Volterra, 1959; Palm & Poggio, 1977; Palm, 1978),

In the most general case the kernels c:") are distributions. We will see, however, that the kernels describing equation A.2 are (almost everywhere) continuous functions from C2(Rt1). Combining equations A.3 and A.4 yields the Volterra series of F ,

c ( 2 ) ( t l ,t 2 ) j(t-tl) j(t-t2)

'

+

The bracket convention is such that (..) denotes the dependenceon a real or complex scalar, and I..]denotes the dependence on a function.

1040

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

+ ...

(A.5)

Because we expand around the constant solution x = 0, the 6 kernels depend not on the time t but on the difference ( t - t*), t‘ being the time when we have stated the initial condition. In this subsection, we calculate the solution for the initial condition x(t’ = -00) = 0 and the 6 kernels do not depend on t at all. We have therefore dropped the index t from 6;”)as it occurs in equation A.4. As is discussed in section A.3, dropping the index is possible only if no spike occurs in the past. We can easily generalize this formalism to systems of differential equations,

and obtain

‘J

+-2!

s

dti dtz tl*’(tl, t 2 ) j ( t - t l ) j ( t - t 2 )

+. . .

As opposed to the lower index t in equation A.4, and the lower index p in equation 2.3, the subscript k in equation A.7 denotes the kth vector compo-

nent of the system throughout the rest of this appendix. A.2 Calculation of the Kernels. We want to prove that the kernels CAI), ch2), . . . can be calculated explicitly for fairly arbitrary systems of differential

equations. In particular, it is possible to obtain analytic expressions for the kernels in terms of the eigenvalues of the linearizationof the N-dimensional system of differential equations we start with. Suppose f in equation A.6 is analytic in the neighborhood of x = 0 so that we can expand f in a Taylor series around x = 0. In doing so, we use the Einstein summation convention. We substitute the Taylor series into equation A.6,

realize that the derivatives are evaluated at x = 0, and switch to Fourier transforms,

Reduction of the Hodgkin-Huxley Equations

1041

(-4.9)

The Fourier transform of the Volterra series (see equation A.7) is

We substitute this into equation A.9 and obtain a polynomial (in a convolution algebra) in terms of j ( w ) ,

Here we have defined the abbreviations

ck

= 8k.l ,

and (A.13)

Equation A . l l holds for every function j . The coefficients of j ( w ) therefore have to vanish identically, and we are left with a set of linear equations for the kernels, which can be solved consecutively,

(A.14)

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

1042

Note that we have to invert only one matrix, ckl(w), in order to solve equation A.14 for all kernels. The inverse of ckl(w) is (A.15)

where C,,,(w) is the adjunct matrix of ckl(w) defined in equation A.12. Solving equation A.14, we obtain for c ; ' ( w l , . . . , w,) an expression with a denominator that is a product of the characteristic polynomial p ( w ) := det ckl(w), evaluated at different frequencies w . For instance, solving equation A.14 for ck(2)(w1, cry) yields a denominator p(w1)p(w2)p(w1 q), and the denominator in c i 3 ' ( w l , q , w3) is p(w1)p(w2)p(w3)p(w1 w3) p(w2 w3)p(wl q w3), and so forth. If we know the roots of p(w), that is, the eigenvalues of the system of differential equations, we can factorize the denominator and easily determine the residues of c r ) ( w l , . . . , w,) and thus derive an analytic expression for the inverse Fourier transform of

+

+ +

+

+

(n) ck (w13...,wn)!

The coefficients uklnl E C are sums and differences of the N eigenvalues hk of the N-dimensional system of differential equations,

The computational difficulties are reduced to the calculation of the N roots of a polynomial of degree N where N is the system dimension; see equation A.6. A.3 Limitations. In most cases, the function F is not an entire function; in other words, the series expansion around the constant solution x = 0 does not converge for all perturbations j (Thomas, 1996). For the Hodgkin-Huxley equations, we have found numerically that the truncated series expansion gives a fine approximation to the true solution only as long as the input current keeps well below the spiking threshold. This is not too surprising, since the solution of Hodgkin-Huxley equations exhibits an apparent discontinuity at spiking threshold. Consider a family of input functions given by

jr :t

HjoO(t

- s)O(t

+

5).

(A.18)

These are square pulses with height j o and duration 2s. There is a threshold jlY(r) and a small constant S with 6 / j I 9 ( s )<< 1 so that for j o < j,Y(s) - S

Reduction of the Hodgkin-Huxley Equations

1043

+

no spike occurs, whereas for jo > j a ( 7 ) S at least one spike is triggered within a given compact interval (see Figure 2). Thus, the two-norm of the derivative, IIF’[jr]112 =Z llF[jr S ] - F [ j , - 8 ] l l 2 / 28, takes large values as j o approaches the value j o ( 7 ) , and we do not expect the series expansion to converge beyond that point. In the body of this article, we introduced the response function q in order to address this problem. The key advantage of 171 and the ”improved” kernelsci’), ci2’,. . . is that we no longer expand around theconstant solution x = 0 but around a solution of the Hodgkin-Huxley equations that contains a single spike at t = tf. In the context of equation A.5, we have argued that the c kernels do not depend on the absolute time. However, since the new zero-order approximation u ( t ) = q l ( t - tf) contains a spike, homogeneity of time is destroyed and the improved c kernels depend on (t - tf) where tf is the latest firing time. Unfortunately, there is no analytic solution for the Hodgkin-Huxley equations with spikes. We therefore have to, and did, resort to numerical methods in order to determine the kernel c r ) in the neighborhood of a spike. In order to construct approximations of solutions with more than one spike, we have to concatenate these single-spike approximations. We can then exploit the fact that the truncated series expansion exhibits rather prominent peaks in the membrane potential at positions where the HodgkinHuxley equations would produce a spike. The task is to decide from the truncated series expansion which of the peaks in the approximated membrane potential belong to a spike and which do not. In the main body of the article, we investigated how well this can be done by using a threshold criterion.

+

Acknowledgments This work was supported by the Deutsche Forschungsgemeinschaft (DFG) under grant numbers He 1729/2-2 and 8-1. References Abbott, L. F., & Kepler, T. B. (1990). Model neurons: From Hodgkin-Huxley to Hopfield. In L. Garrido (Ed.),Statistical Mechanics ofNeural Networks. Berlin: Springer-Verlag. Abbott, L. F., & Vreeswijk, C. van. (1993). Asynchronous states in a network of pulse-coupled oscillators. Phys. Rev. E 48:1483-1490. Av-Ron, E., Parnas, H., & Segel, L. A. (1991). A minimal biophysical model for an excitable and oscillatory neuron. Biol. Cybern. 65:487-500. Cole, K. S., Guttman, R., & Bezanilla, F. (1970). Nerve membrane excitation without threshold. Proc. Nafl. Acad. Sci. 65(4):884-891. Cronin,J. (1987).Mathematical aspects ofHodgkin-Huxley theory. Cambridge:Cambridge University Press.

1044

Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen

Dieudonne, J. (1968).Foundations of modern analysis. New York: Academic Press. Dunford, N., & Schwartz, J. T. (1958).Linear operators, Part I: General theory. New York: Wiley. Ekeberg, O., Wallen, P., Lansner, A., Traven, H., Brodin, L., & Grillner, S. (1991). A computer based model for realistic simulations of neural networks. Biol. Cybern. 6581-90. Feller, W. (1970).A n Introduction to Probability Theory and Its Applications (3rd ed., Vol. 1).New York Wiley. FitzHugh, R. (1961).Impulses and physiological states in models of nerve membrane. Biophys. 1, 1:445-466. Gerstner, W. (1991).Associative memory in a network of ”biological” neurons. In R. I? Lippmann, J. E. Moody, & D. S. Touretzky (Eds.),Advances in Neural Information Processing Systems 3 (pp. 84-90). San Mateo, CA: Morgan Kaufmann. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E 51:738-758. Gerstner, W., & Hemmen, J. L. van. (1992).Associative memory in a network of “spiking” neurons. Network 3:139-164. Gerstner, W., & Hemmen, J. L. van. (1993). Coherence and incoherence in a globally coupled ensemble of pulse emitting units. Phys. Rev. Lett. 71(3):312315. Gerstner, W., Hemmen, J. L. van, & Cowan, J. D. (1996).What matters in neuronal locking? Neural Comp. 8:1689-1712. Hille, E., & Phillips, R. S. (1957).Functional analysis and semi-groups. Providence, RI:American Mathematical Society. Hodgkin, A. L., & Huxley, A. F. (1952).A quantitative description of ion currents and its applications to conduction and excitation in nerve membranes. 1. Physiol. (London) 117:500-544. Hopfield, J. J., & Herz, A. V. M. (1995). Rapid local synchronization of action potentials: Towards computation with coupled integrate-and-fire networks. Proc. Natl. Acad. Sci. U S A 92:6655. Joeken, S., & Schwegler, H. (1995). Predicting spike train responses of neuron models. In M. Verleysen (Ed.),Proc. 3rd European Symposium on Artificial Neural Networks (pp. 93-98). Brussels: D facto. Kepler, Thomas B., Abbott, L. F., & Marder, E. (1992).Reduction of conductancebased neuron models. Biol. Cybern. 66:381-387. Kernell, D., & Sjoholm, H. (1973). Repetitive impulse firing: Comparison between neuron models based on voltage clamp equations and spinal motoneurons. Acta Physiol. Scand. 874&56. Koch, C., Bernander, O., & Douglas, R. J. (1995).Do neurons have a voltage or a current threshold for action potential initiation? 1. Comp. Neurosci. 2:63-82. Kolmogorov, A. N., & Fomin, S. V. (1975).Reelle Funktionen und Funktionalanalysis. Berlin: VEB Deutscher Verlag der Wissenschaften. Korenberg, M. J. (1989).A robust orthogonal algorithm for system identification and time-series analysis. Biol. Cybern. 60267-276. Lapicque, L. (1907).Recherches quantitatives sur l’excitation electrique des nerfs traitee comme une polarisation. 1. Physiol. Pathol. Gen. 9620-635.

Reduction of the Hodgkin-Huxley Equations

1045

Mainen, Z. F., & Sejnowski, T. J.(1995).Reliability of spike timing in neocortical neurons. Science 268:1503-1506. Mauro, A., Conti, F., Dodge, F., & Schor, R. (1970).Subthreshold behavior and phenomenological impedance of the giant squid axon. 1.Gen. Physiol. 55:497523. Palm, G. (1978). On representation and approximation of nonlinear systems. Biol. Cybernetics 31:119-124. Palm, G., & Poggio, T. (1977). The Volterra representation and the Wiener expansion: Validity and pitfalls. SIAM 1. Appl. Math. 33(2):195216. Palm, G., & Poggio, T. (1978). Stochastic identification methods for nonlinear systems: An extension of the Wiener theory. SlAM 1. Appl. Math. 34(3):524534. Prohorov, Yu. V., & Rozanov, Yu. A. (1969).Probability. Berlin: Springer-Verlag. Rinzel, J. (1985).Excitation dynamics: Insights from simplified membrane models. Federation Proc. 44:2944-2946. Rinzel, J., & Ermentrout, G. B. (1989).Analysis of neuronal excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in Neuronal Modeling. Cambridge, MA: MIT Press. Thomas, E. G. F. (1996).Q-analytic solution operatorsfor non-linear differential equations (Reprint). Department of Mathematics, University of Groningen. Traub, R. D., Wong, R. K. S., Miles, R., & Michelson, H. (1991). A model of a CA3 hippocampal pyramidal neuron incorporating voltage-clamp data on intrinsic conductances. 1. Neurophysiol. 66:635450. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Patterns of synchrony in inhomogeneous networks of oscillatorswith pulse interaction. Phys. Rev. Lett. 72 :1281-1283. Usher, M., Schuster, H. G., & Niebur, E. (1993). Dynamics of populations of integrate-and-fireneurons, partial synchronization and memory. Neural Computat ion 5570-586. Volterra, V. (1959). Theory of Functionals and of integral and Integro-Differential Equations. New York: Dover. Wiener, N. (1958). Nonlinear Problems in Random Theory. Cambridge, MA: MIT Press. Wilson, M. A,, Bhalla, U. S., Uhley, J. D., & Bower, J. M. (1989). Genesis: A system for simulating neural networks. In D. Touretzky (Ed.), Advances in Neural Information Processing Systems (pp. 485-492). San Mateo, CA: Morgan Kaufmann. Yamada, W. M., Koch, C., & Adams, I? R. (1989). Multiple channels and calcium dynamics. In C. Koch & I. Segev (Eds.), Methods in Neuronal Modeling: From Synapses to Netzuorks. Cambridge, MA: MIT Press. Received April 17,1996;accepted October 28,1996.

Communicated by Bruce Knight

Noise Adaptation in Integrate-and-Fire Neurons Michael E. Rudd Lawrence G. Brown Department of Psychology, Johns Hopkins University, Baltimore, Maryland 21218, U.S.A.

The statistical spiking response of an ensemble of identically prepared stochastic integrate-and-fire neurons to a rectangular input current plus gaussian white noise is analyzed. It is shown that, on average, integrateand-fire neurons adapt to the root-mean-square noise level of their input. This phenomenon is referred to as noise adaptation. Noise adaptation is characterized by a decrease in the average neural firing rate and an accompanying decrease in the average value of the generator potential, both of which can be attributed to noise-induced resets of the generator potential mediated by the integrate-and-fire mechanism. A quantitative theory of noise adaptation in stochastic integrate-and-fire neurons is developed. It is shown that integrate-and-fire neurons, on average, produce transient spiking activity whenever there is an increase in the level of their input noise. This transient noise response is either reduced or eliminated over time, depending on the parameters of the model neuron. Analytical methods are used to prove that nonleaky integrate-and-fire neurons totally adapt to any constant input noise level, in the sense that their asymptotic spiking rates are independent of the magnitude of their input noise. For leaky integrate-and-fire neurons, the long-run noise adaptation is not total, but the response to noise is partially eliminated. Expressions for the probability density function of the generator potential and the first two moments of the potential distribution are derived for the particular case of a nonleaky neuron driven by gaussian white noise of mean zero and constant variance. The functional significance of noise adaptation for the performance of networks comprising integrate-and-fire neurons is discussed. 1 Introduction In this article, we analyze the statistical spiking behavior of stochastic integrate-and-fire neurons, that is, integrate-and-fire neurons that receive probabilistic input. We consider the rather general case in which the random fluctuations in the neural input are modeled as gaussian white noise. We show that neurons whose range of allowable generator potential states is unbounded in the negative domain exhibit a dynamic and probabilistic Neural Computation 9, 1047–1069 (1997)

c 1997 Massachusetts Institute of Technology °

1048

Michael E. Rudd and Lawrence G. Brown

adaptation to the root-mean-square noise level of their input. This suggests a technique by which noise suppression can be carried out in artificial neural networks consisting of such neurons and raises the question of whether a similar noise adaptation mechanism exists in actual biological neural networks. The stochastic integrate-and-fire model of neural spike generation was introduced about thirty years ago (Gerstein, 1962; Gerstein & Mandelbrot, 1964) and has since been the subject of numerous theoretical papers (e.g., Calvin & Stevens, 1965; Capocelli & Ricciardi, 1971; Geisler & Goldberg, 1966; Gluss, 1967; Johannesma, 1968; Stein, 1965, 1967) and reviews (Holden, 1976; Ricciardi, 1977; Tuckwell, 1988, 1989). Because the spike generation model is not a new idea, many users of the model probably assume that its properties are well—if not completely—understood. This is far from the case. Analytical results have generally proved difficult to obtain, and the main results in the literature consist of expressions for the density function and moments of the spike interarrival times of the neuron (Holden, 1976; Ricciardi, 1977; Tuckwell, 1988, 1989). In fact, not even a closed-form solution for the interarrival time density is known for the biologically important case of an integrate-and-fire model that includes a membrane decay (Tuckwell, 1988, 1989). Some behavioral properties of the stochastic integrate-and-fire neuron, such as the firing rate, are most naturally defined in terms of the statistics of an ensemble of identically prepared neurons. For example, in singlecell recording studies of individual neurons, the firing rate dynamics are often illustrated by a response histogram, which is a tabulation of many probabilistic spiking events as a function of the time after stimulus onset. The response histogram represents a statistical sample of spike counts taken from an ensemble of identically prepared neurons. In this article, we present some new analytic results that add considerable insight into the theoretical ensemble spiking statistics of integrate-and-fire neurons. Specifically, we consider the case of an integrate-and-fire neuron whose generator potential is unbounded in the negative domain, and derive an expression for the timedependent average firing rate of the neuron, assuming an input consisting of a rectangular current plus gaussian white noise. We demonstrate that the neuron has the surprising property of adapting, on average, to the level of the noise in its input. We then analyze the evolution of the probabilistic state of the generator potential to discover the mechanism behind this noise adaptation. 2 The Stochastic Integrate-and-Fire Spike Generation Model Let U(t) stand for the generator potential of the integrate-and-fire neuron. According to the standard mathematical formalization of integrate-andfire neurons (Holden, 1976; Ricciardi, 1977; Tuckwell, 1988, 1989), we can write a stochastic differential equation that describes the time evolution

Noise Adaptation

1049

of the generator potential. In each small time step, the potential U(t) is increased by an amount that depends on the mean input rate µ(t) and the duration of the step dt. The potential is also perturbed by gaussian white noise with variance per unit time σ 2 (t). Finally, a percentage αdt of the generator potential decays away as a result of a membrane leak. Thus, the generator potential obeys the stochastic differential equation dU(t) = (−αU(t) + µ(t))dt + σ (t)dW(t),

(2.1)

where W(t) is a standard Wiener process with mean zero and unit variance per unit time. As is well known, equation 2.1 describes the evolution of a stochastic diffusion process, known in the mathematical literature as the Ornstein-Uhlenbeck process (OUP) (Uhlenbeck & Ornstein, 1930; Tuckwell, 1988, 1989). The OUP is a type of transformed Brownian motion process. Whenever the generator potential of the integrate-and-fire neuron exceeds the value of a neural firing threshold θ, the neuron emits a single neural pulse, and the generator potential is reset to zero. The neural output signal y(t) thus evolves according to the following formula (Chapeau-Blondeau, Godivier, & Chambet, 1996): If U(t) = θ then y(t) = δ(t0 − t), U(t) = 0; else y(t) = 0.

(2.2)

Equations 2.1 and 2.2 represent the standard formulation of the integrateand-fire model. Note, however, that the effect of the resets on the probabilistic evolution of the generator potential is not explicitly taken into account in the standard formulation of the problem (see equation 2.1). Technically, equation 2.1 describes only the evolution of the generator potential between successive spikes. To develop a probabilistic model of the generator potential that takes the resets into account and is therefore not limited to time intervals in which no spikes occur, we modify equation 2.1 as follows: dV(t) = (−αV(t) + µ(t))dt + σ (t)dW(t) − θ dN(t),

(2.3)

where dN(t) is the number of resets in the interval dt; we now denote the potential by a new symbol, V(t), in order to distinguish clearly the behavior of the generator potential with resets (see equation 2.3) from the behavior of the OUP U(t) (see equation 2.1). To complete our formalization of the stochastic integrate-and-fire neuron, we may then write If V(t) = θ then y(t) = δ(t0 − t), V(t) = 0; else y(t) = 0.

(2.4)

Note that explicit consideration of the resets on the generator potential results in the appearance of an inhibitory term in the stochastic equation 2.3, for the time evolution of the potential that is not present in the standard equation for the generator potential (see equation 2.1). In fact, the only difference between equations 2.1 and 2.3 is the presence of an extra inhibitory

1050

Michael E. Rudd and Lawrence G. Brown

term in equation 2.3 that arises from the operation of the reset mechanism of the integrate-and-fire neuron. The inhibition resulting from the neural resets tends to reduce the potential compared to the hypothetical potential that would be produced in the absence of resets. Thus, the resets tend to hyperpolarize the neuron. We will show that this reset-induced hyperpolarization has important implications for the dynamics of the mean spiking rate of the stochastic integrate-and-fire neuron. Informal reasoning suggests that as the potential is progressively decreased by a series of resets while the threshold remains fixed, the neuron will fire less; thus, the reset mechanism will act to decrease the firing rate progressively until the separate influences that act to increase (input current) and decrease (resets) the generator potential exactly counterbalance one another. Once the effects of these opposing influences equilibrate, the firing rate will stabilize. This reasoning is directly checked in the next section by computing the the average spiking rate of the neuron as a function of time. 3 Firing Rate Dynamics In this section, we investigate the spiking response of the neuron to an input of constant mean rate µ and diffusion coefficient σ 2 that is turned on at t = 0. The neuron is initially at potential V(0). At time zero it is suddenly perturbed by a rectangular input current, to which is added gaussian white noise. For purposes of exposition, we will momentarily assume that the neuron has no membrane leak. In this special case, we can derive an expression for the time-dependent neural firing rate by using the well-known density function for the first passage time of a Brownian motion initially localized at the point (V(0), 0) to cross an arbitrary voltage level Va , which is one of the few analytic expressions that we have in our arsenal for understanding the behavior of the neuron. The density function is f (t) =

Va − V(0) −(Va −V(0)−µt)2 /2σ 2 t e √ σ 2π t3

(3.1)

(Karlin & Taylor, 1975). From this density we immediately obtain the first firing time density: f1 (t) =

θ − V(0) −(θ −V(0)−µt)2 /2σ 2 t e . √ σ 2π t3

(3.2)

More generally, it can be shown (Rudd, 1996) that the density for the time of the Nth neural spike is fN (t) =

Nθ − V(0) −(Nθ−V(0)−µt)2 /2σ 2 t e . √ σ 2π t3

(3.3)

Noise Adaptation

1051

It is easily seen that the average spiking rate is given by the infinite sum ∞ X dN(t) = fN (t). dt N=1

(3.4)

In Figure 1a, we plot the infinite series (3.4) for the parameter set V(0) = 0, µ = 150, θ = 50, and three different values of σ . Note that the firing rate has both a transient noise-dependent component and a sustained noiseindependent component. The initial, transient, firing rate is larger for larger values of the input noise. However, in the long run (as t → ∞), as indicated on the plot, dN(t)/dt → µ/θ , a value independent of σ . The asymptotic (sustained) firing rate can be derived by using a mathematical theorem (Parzen, 1962, p. 180), which relies on the fact that the number of spikes N(t) generated in the interval [0, t] is a renewal counting process. We first note that the average value of the random time ξ for the potential initialized at zero to cross the neural threshold is ξ¯ = θ/µ (Rudd, 1996) and then apply the theorem: 1 N(t) = . t→∞ t ξ¯ lim

(3.5)

Multiplying both sides of equation 3.5 by t and taking the time-derivative yields dN(t)/dt → ξ¯ −1 as t → ∞. In the specific case of the nonleaky integrate-and-fire model, dN(t)/dt → µ/θ as t → ∞. In what follows, we will refer to equation 3.5 as the renewal counting theorem. When µ = 0, ξ¯ is infinite, and the renewal counting theorem therefore does not apply. However, as µ → 0, dN(t)/dt → 0 as t → ∞, which suggests that the firing rate will tend to zero in the long run when the neuron is driven by gaussian white noise of constant variance and mean zero. The same conclusion is reached by noting that for µ = 0 and as t → ∞, every term in our infinite series expression (3.4) for the neural firing rate goes to zero. We further verify the conclusion by plotting in Figure 1b the spiking rate of a nonleaky integrate-and-fire neuron with µ = 0 driven by gaussian white noise of three different magnitudes. Our results to this point indicate that when an integrate-and-fire neuron is suddenly perturbed by a noisy input, the neuron produces a transient spiking response that is positively related to the noise fluctuation level; in the long run, the response to noise completely dies away, and the mean firing rate depends only on the deterministic component of the input. In the next section we will show that the tendency of the neuron’s response to noise to decay over time represents a form of stochastic neural adaptation. The noise response is eliminated by the reset-induced hyperpolarization of the generator potential. We refer to this phenomenon as noise adaptation. The word adaptation connotes both the idea of dynamic change and the idea that

1052

Michael E. Rudd and Lawrence G. Brown

Figure 1: Average firing rate of the integrate-and-fire neuron as a function of time, plotted for three different values of the general potential noise level σ . (a) Average firing rate of a nonleaky neuron with mean input rate µ = 150, calculated on the basis of equation 3.4. (b) Average firing rate of a nonleaky with mean input rate µ = 0, calculated on the basis of equation 3.4. (c) Simulated firing rate of a leaky integrate-and-fire neuron (α = 3.33) driven by at a constant mean input rate (µ = 150). Additional model parameters in (a), (b), and (c): θ = 50; V(0) = 0. Time in arbitrary units.

Noise Adaptation

1053

the change may serve an adaptive function. To the extent that the function of an individual neuron within a network is to perform a calculation based on a weighted sum of its deterministic input currents, the noise response is spurious, and noise adaptation serves the purpose of reducing the number of noise-induced spikes (i.e., reducing the number of false positives). Before we present a mathematical proof that the dynamic firing rate reduction represents an adaptation to noise, we will briefly return to the problem of investigating the average firing rate dynamics of integrate-andfire neurons. So far, we have derived an expression only for the average firing rate of an integrate-and-fire neuron that has no membrane leak. Ideally, we would like to find a general expression for the average firing rate that would apply to leaky integrate-and-fire neurons as well as nonleaky ones. Despite considerable effort, however, we have not been able to obtain such an expression. We therefore rely on computer simulation to analyze the properties of the leaky neuron in order to check the generality of our conclusions. In Figure 1c, we plot the firing rate of a simulated leaky integrate-and-fire neuron (α = 3.33) for a constant positive mean input (µ = 150) and three different levels of input noise (indicated on the plot). As in the case of the nonleaky neuron, the leaky neuron produces an initial transient response to noise, the magnitude of which is positively related to the input noise level. The transient response is followed by a decline in the firing rate while the input remains constant. Since the neural response decreases dynamically while the input remains constant, it appears that some form of neural adaptation is taking place. In fact, this state of affairs might be taken as a definition of adaptation. The renewal counting theorem, equation 3.5, can be used, together with any of several exact or approximate expressions for the mean firing time of the leaky-integrator neuron (e.g., Roy & Smith, 1969; Thomas, 1975; Ricciardi & Sacerdote, 1979; Wan & Tuckwell, 1982; Tuckwell, 1989), to derive an expression for the long-run value of the average firing rate of the neuron. Wan and Tuckwell (1982) obtained several approximate expressions for the mean firing time and noted that noise always tends to decrease the average spike interarrival time of the leaky integrator neuron (regardless of whether the mean input rate is positive or zero). From this fact and equation 3.5, it follows that the sustained firing rate of the neuron with a membrane leak increases with the level of the input noise. The simulated firing rates plotted in Figure 1c are consistent with this conclusion. Thus, we conclude that the response to noise is not completely eliminated over time in the leaky neuron. However, our firing rate simulations indicate that a greater amount of noise-response reduction (transient response minus sustained response) is observed at higher-input noise levels. The overall pattern of results indicates that the dynamic reduction in the average neural firing rate acts to reduce the level of the transient noise response in the leaky neuron, with greater compensation occurring at higher

1054

Michael E. Rudd and Lawrence G. Brown

noise levels. This pattern is consistent with the interpretation that leaky integrate-and-fire neurons adapt to reduce their response to noise. The dynamic reduction in the noise response appears to be a general feature of integrate-and-fire neurons, with or without a membrane leak, although the noise dependence of the average firing rate is asymptotically eliminated only in the case of the nonleaky neuron. 4 Generator Potential Dynamics In order to understand better the mechanism behind the dynamical changes in the neural firing rate, we now return to our analysis of the time-dependent behavior of the generator potential. Setting α = µ(t) = 0 and σ (t) = σ , a constant, in equation 3.5, taking the ensemble averages of the quantities on both sides of the equation, and time-integrating yields V(t) = V(0) − θN(t) ,

(4.1)

where V(0) is the value of the generator potential at time zero, and N(t) is the average number of resets in the interval [0, t]. Since µ(t) = 0, the neuron is driven by gaussian white noise of mean zero and neural spikes are generated solely on the basis of noise fluctuations. According to equation 4.1, the reset mechanism acts in this special case to decrease progressively the mean value of the generator potential, and since the mean input rate is zero, there is no counteracting tendency for the average potential to be increased by the input. With each additional reset, the average value of the generator potential is further decreased. We anticipate that the reduction in the mean level of the potential will be associated with a concomitant reduction in the neural firing rate. The generator potential will continue to be decreased until the cell eventually stops firing altogether. This anticipated behavior is consistent with our previous result based on the renewal counting theorem, which indicated that a nonleaky neuron driven by noise alone turns itself off in the long run. Without the insight gained from a stochastic analysis of the neuron’s behavior, an empirical observation of the neuron’s behavior might lead to the mistaken conclusion that the firing rate is subject to negative feedback. Further insight can be gained by averaging both sides of equation 2.3 (with α = µ(t) = 0 and σ (t) = σ , a constant) and dividing by dt to obtain: dN(t) dV(t) = −θ . dt dt

(4.2)

Since, as proved in the previous section, the average firing rate goes to zero as t → ∞, the time derivative of the average value of the generator potential also must go to zero in the same limit.

Noise Adaptation

1055

This approach is easily generalized to the case of an integrate-and-fire neuron with arbitrary parameters. In the most general case, dN(t) dV(t) = −αV(t) + µ(t) − θ . dt dt

(4.3)

In the last section it was shown that the firing rate tends to a constant as t → ∞. In that limit, dV(t)/dt → 0. By making this substitution in equation 4.3, we obtain the asymptotic firing rate µ − αV(∞) dN(∞) = , dt θ

(4.4)

where we have assumed µ(t) to be constant. For α = 0 (nonleaky neuron), dN(t)/dt → µ/θ as t → ∞, which is consistent with our earlier result based on the renewal counting theorem. For arbitrary α, equation 4.4 can be rewritten in the form θ V(∞) = α

Ã

! dN(∞) µ− . dt

(4.5)

The asymptotic level of the mean generator potential can be obtained from equation 4.5 by using the renewal counting theorem, equation 3.5, and any of several expressions for the mean firing time (Roy & Smith, 1969; Thomas, 1975; Ricciardi & Sacerdote, 1979; Wan & Tuckwell, 1982; Tuckwell, 1989) to compute the asymptotic average firing rate of the leaky neuron. For a leaky neuron driven by gaussian white noise of mean zero, the asymptotic value of V(t) will be negative and of larger absolute value at higher noise levels, since the asymptotic firing rate is larger for larger values of σ . Again assuming that α = 0 and µ(t) constant, we next substitute equation 3.4 into equation 4.3 to obtain ∞ X dV(t) =µ−θ fN (t) dt N=1

(4.6)

and V(t) = V(0) + µt − θ

Z tX ∞ 0 N=1

fN (s) ds.

(4.7)

We then rearrange equation 4.7 to obtain "Z V(0) − V(t) = θ

∞ tX

0 N=1

fN (s) ds −

³µ´ θ

# t .

(4.8)

1056

Michael E. Rudd and Lawrence G. Brown

The time integral on the right-hand side of equation 4.8 is the average number of spikes emitted up to time t. The quantity in parentheses is the average firing rate of an integrate-and-fire neuron with no input noise. Therefore, the quantity in brackets is the difference between the number of spikes emitted in the interval [0, t] with and without input noise. Equation 4.8 indicates that the amount by which the neural potential at time t is decreased, or hyperpolarized, relative to the resting potential is equal to the product of the threshold and the number of “extra” spikes generated when gaussian white noise is added to the deterministic input current. We thus conclude that the generator potential is hyperpolarized, on average, by an amount that depends only on the level of the noise, not on the average input level. This result supports our claim that the adaptation observed in the average spiking rate of the stochastic integrate-and-fire neuron is an adaptation to noise. Further knowledge of the dynamic behavior of V(t) can be gained by reasoning qualitatively from equation 4.6 while simultaneously recalling the dynamic behavior the infinite sum (i.e., the average firing rate) illustrated in Figure 1a. At time zero, the average firing rate of the neuron is zero, so the rate of change in V(t) must initially be positive, and V(t) itself must initially increase. In order for the time derivative of the mean generator potential to change signs, its value must pass through zero. This occurs when the average firing rate is µ/θ. The average firing rate first increases with time, then decreases asymptotically to the value µ/θ. It attains the value µ/θ at only one finite time, which occurs during the rising portion of the firing rate function. Let us call this time t0 . It follows that V(t) first increases from the value V(0) during the interval [0, t0 ), then decreases during the interval (t0 , ∞]. We verified these conclusions by plotting equation 4.7 as a function of time in Figure 2. Values of V(t) were computed numerically with V(0) = 0. Figure 2a illustrates the initial behavior of V(t) examined over a brief period. The plot indicates that V(t) first increases and thereafter monotonically decreases, as anticipated by the preceding analysis. We also verified by plotting (not shown) the average firing rate of a nonleaky neuron with the same parameters that the turnaround occurs during the initial rising phase of the neural firing rate, when the firing rate first attains the level µ/θ. Figure 2b illustrates the behavior of V(t) on a larger time scale. The initial rise in V(t) and subsequent turnaround are not obvious at this scale, but the tendency for V(t) to decrease monotonically at large adaptation times is clear. When µ = 0 (nonleaky neuron driven by gaussian white noise alone), V(t) is a monotonically decreasing function of time. In that case, the infinite sum in equation 4.6 goes to zero as t → ∞, so the time derivative of the average potential must also asymptotically go to zero. The integral in equation 4.7 diverges for large t (since it is the sum of an infinite number of density functions and the time integral of each goes to one for t large). Thus, when µ = 0, V(t) → −∞ as t → ∞. After a long period of adaptation to gaussian white noise of mean zero,

Noise Adaptation

1057

Figure 2: Time dependence of the mean generator potential of a nonleaky integrate-and-fire neuron, computed on the basis of equation 4.7. (a) The initial behavior of V(t), illustrated on a short time scale. V(t) first rises, then monotonically decreases with time. (b) When viewed over a longer adaptation time, including the brief initial period illustrated in (a), a deceleration of the final monotonic decrease in V(t) is apparent. Model parameters: θ = 1; α = 0; µ = 2 × 10−4 ; V(0) = 0. Time in arbitrary units.

1058

Michael E. Rudd and Lawrence G. Brown

Figure 3: Average level of the generator potential of a nonleaky integrate-andfire driven by a gaussian white noise with mean zero and constant variance as a function of the standard deviation of the time-integrated noise in the Brownian motion process that drives the neuron. Additional model parameters: θ = 50; α = 0; µ = 0; V(0) = 0.

the generator potential will thus, on average, be large and negative. An informal argument suggested by a reviewer allows us to use this fact to find the approximate functional dependence of the average potential on the variables σ and t. At large negative values along the potential continuum, the thresholded Brownian motion will appear identical to a regular Brownian motion with a downward-reflecting barrier imposed at V(0). The average value of such √ a “reflected Brownian motion” is proportional to the standard deviation σ t of an unrestricted Brownian motion with no absorbing or reflecting barriers. It follows that, for the integrate-and-fire √ neuron driven by be approximately proportional to σ t for sufficiently noise alone, V(t) must √ large values of σ t. We verified this informal reasoning by √ evaluating equation 4.7 numerically and plotting the function versus σ t in Figure 3. As anticipated, the average value of the generator potential is a linearly decreasing function of the time-dependent standard deviation of the original Brownian motion process for t sufficiently large. This result further supports our claim that the integrate-and-fire neuron undergoes noise adaptation and quantifies the dependence of this noise adaptation on σ and t in the particular case in which α = µ(t) = 0. The approximate equivalence between the probabilistic path structure of the generator potential of an integrate-and-fire neuron driven by noise alone and that of a reflected Brownian motion generalizes to the case of the leaky neuron, provided that the noise level is sufficiently large. In this case, the density function of the generator potential is approximately equivalent

Noise Adaptation

1059

to that of a reflected OUP at large t. In support of this claim, we note that the equivalence holds for large, negative values of V(t) and that, according to equation 4.5, (with µ = 0), large, negative values will be obtained at large adaptation times, provided that the asymptotic firing rate is sufficiently large. Recall that the asymptotic firing rate is an increasing function of the noise level; therefore, the equivalence must hold for sufficiently large values of σ . It follows that V(t) will be approximately equivalent to the average level of an OUP reflected at the origin when V(t) is large and negative. The average value of the reflected OUP is proportional to the standard deviation of the regular, unrestricted OUP (Tuckwell, 1988, 1989): s σ2 (1 − e−2αt ) . 2α We have not been able to derive an expression for the probability density function of the generator potential in the most general case in which none of the parameters of the model is fixed; however, in the appendix, we derive expressions for the density function and the first two moments of the generator potential in the special case in which α = µ(t) = 0 and σ (t) = σ , a constant. There, the first moment is obtained by a method different than the one that led to equation 4.7. 5 Summary We have investigated the effect of the reset mechanism of the stochastic integrate-and-fire neuron on the dynamics of both the average neural spiking rate and the statistics of the generator potential. The response of the neuron to a constant input with additive gaussian white noise was investigated in detail, under the assumption that the generator potential is unbounded in the negative range. By a combination of analytic and simulation methods, we showed that both leaky and nonleaky stochastic integrate-and-fire neurons produce a transient spiking response, the magnitude of which is positively related to the root-mean-square noise level of the input. In the specific case of the nonleaky neuron, it was proved that the noise response is dynamically and completely suppressed by a depression of the average level of the generator potential (i.e., hyperpolarization of the neuron) resulting from the neural reset mechanism. In a neuron with a membrane leak, the noise response is not completely suppressed by the reset mechanism, but it is nevertheless dynamically reduced by it. The reduction in the firing rate (measured by the difference between the peak transient and asymptotic spiking rates) is an increasing function of the input noise. We conclude that all integrate-and-fire neurons—leaky or not—for which the generator potential is unbounded in the negative domain adapt to the noise in their input; in the long run the adaptation is complete only in the case of the nonleaky neuron.

1060

Michael E. Rudd and Lawrence G. Brown

We derived a differential equation (see equation 4.4) that relates the rate of change in the mean level of the generator potential to the average neural firing rate. In the case of the nonleaky neuron, the average firing rate can be computed from an infinite series expression, and equation 4.4 can then be used to write an expression (see equation 4.6) for the time derivative of the mean value of the generator potential in terms of this series. We used this fact to study qualitatively the dynamic behavior of the mean generator potential. When the nonleaky neuron is driven by an input having a positive mean rate, the average potential first increases, then decreases. When the neuron is driven by gaussian white noise of mean zero, the average potential decreases over time monotonically, asymptotically tending to minus infinity in the limit of large t. For sufficiently large t, the average generator potential level in a nonleaky neuron driven by gaussian white noise of mean zero is a linearly decreasing function of the standard deviation of the time-integrated noise input. We presented an argument suggesting that this should also be the case in the model that includes a membrane leak. In the specific case of the nonleaky neuron, we showed that the average amount of decrease in the generator potential (hyperpolarization) over the interval [0, t] is directly related to the average number of extra spikes generated during that interval over and above those that would be generated by a deterministic input current of the same intensity. In other words, the generator potential is inhibited on average by an amount that depends only on the number of neural spikes that can be attributed to noise, which increases with increasing noise magnitude. It follows that the dynamic reduction in the neural firing rate similarly depends only on the magnitude of the noise fluctuations in the input. This analysis supports our claim that the dynamic change in the average neural firing rate of an integrate-and-fire neuron in response to a stochastic input with constant parameters represents noise adaptation. 6 General Conclusions The spike generation mechanism of stochastic integrate-and-fire neurons introduces an important nonlinearity into the operation of any circuit constructed of these units. In general, the consequences of this nonlinearity for the behavior of neural networks are not well understood. We have shown that the threshold nonlinearity produces a probabilistic neural adaptation that counteracts the tendency for an integrate-and-fire neuron to spike in response to noise fluctuations in its generator potential. This property of integrate-and-fire neurons might be useful in both artificial and biological neural networks, where it could function to reduce the number of falsepositive responses of individual neurons within the network. We note in particular that noise adaptation could serve to reduce, or even totally eliminate, the effect of any constant internal (thermal) noise level within a network.

Noise Adaptation

1061

We envision that noise adaptation could be used to dynamically adjust the sensitivity of individual neurons within a network as either the internal noise or the input noise level changed on a time scale that is slow compared to the time scale over which modulations in the spiking rates are interpreted as meaningful signals. By increasing the average distance between the generator potential and the neural threshold as a function of the level of voltage noise, noise adaptation would tend dynamically to control the individual gains of the network neurons to compensate for both internal and external noise (Rudd & Brown, 1994, 1995, 1996, in press). Thus, the fact that integrate-and-fire neurons undergo noise adaptation also implies that they exhibit a stochastic gain control, which in a network context could function to keep the spiking rates of individual neurons operating within an optimal range, despite gradually changing neural noise levels. If the spike generation model was made somewhat more realistic by the additional assumption of an absolute refractory period, noise adaptation would also help to protect the neurons from response saturation due to noise-induced spiking activity. We became interested in noise adaptation in integrate-and-fire neurons while carrying out simulation studies in which we examined the effect of the spike generation mechanism of retinal ganglion cells on the behavior of a retinal circuit that receives probabilistic light input (photons) (Rudd & Brown, 1993, 1994, 1995, 1996, in press). Because the number of photons falling on a small area of the retina within a small time interval obeys Poisson statistics, the mean and variance of the photon count are equal. In the context of our integrate-and-fire ganglion cell model, the noise adaptation mechanism described in this article results in a noise adaptation that depends on the level of the random fluctuations in the number of photons falling within the ganglion cell receptive field, and thus also on the mean light level. Noise adaptation in our retinal ganglion cell model results in a type of light adaptation that is inherently stochastic and is therefore distinct from the deterministic retinal light adaptation models that have been previously proposed. Our ganglion cell noise adaptation model accounts for a large body of light adaptation data, both physiological and psychophysical, that cannot be accounted for by deterministic light adaptation models. Furthermore, we have used the model to make predictions about the dependence of the apparent brightness of light flashes on the mean light level (Rudd & Brown, 1996) that have since been experimentally verified (Brown & Rudd, submitted). The ubiquity of spike generation in the central nervous system suggests that similar noise adaptation may in occur in other neural circuits. Because of the large number of neural subsystems to which our results might potentially be relevant, some discussion of the biological plausibility of the model is in order. Clearly, the intracellular potential of actual neurons cannot take on arbitrarily large negative values, as we have assumed here. However, it is probably not critical to assume that the generator potential can take on arbitrarily large, negative values in order to achieve significant amounts of

1062

Michael E. Rudd and Lawrence G. Brown

noise adaptation in integrate-and-fire neurons. The biological plausibility of the noise adaptation hypothesis appears to depend critically on whether reset-induced hyperpolarization in real neurons can build up to a sufficient degree to account for the levels of adaptation observed in vivo. In the context of our light adaptation work, we simulated the response of a neuron in which the maximum hyperpolarization state was limited by a hard saturation. In one such simulation, we imposed a lower bound of −15θ on the range of the generator potential without noticeably affecting the spiking response of the neuron when it was driven by gaussian white noise of mean zero. However, we have not experimented with variants of the hard saturation model in which the range of allowable hyperpolarization states is shallower, nor have we parametrically investigated the effects of varying the lower bound on the degree of noise adaptation. In the theoretical neurobiology literature, an alternative version of the integrate-and-fire model in which the generator potential is restricted to the range 0 ≤ V(t) ≤ θ by a reflecting barrier at zero is often alternatively assumed. This alternative boundary condition is in some sense the opposite of the assumption made in this article that there is no lower bound on the potential. Whether the lower bound on the potential is modeled as a reflecting barrier or a hard saturation, a biologically realistic integrate-and-fire model would allow for some range of negative potential states. Although restricting the negative potential states to a small range would not seem to allow much room for noise adaptation, it is difficult to estimate the amount of adaptation that should be expected from such a model without performing a complete probabilistic analysis, because the effect of hyperpolarization on the firing rate depends not just on the average potential level but also on the distribution of potential states. Perhaps even a small, negative potential range would allow for enough adaptation to account for a significant amount of the adaptation observed in actual neurons. The noise adaptation properties of our integrate-and-fire model have relevance to a large body of contemporary simulation work in which networks are constructed of stochastic integrate-and-fire neurons. In many such networks, the individual neurons within the network receive a large number of combined excitatory and inhibitory inputs. By the central limit theorem, the inputs to the spike generation mechanisms of the network neurons integrated over a sufficiently long time period will be well approximated as gaussian random variables, and the model developed in here will then apply to the individual neurons within the network. Unless lower bounds on the neural activation levels in such a network model are explicitly imposed by the modeler, the network neurons will undergo noise adaptation of the type that we have described. Caution should be exercised in attributing any observed dynamical changes in the average firing rates of individual neurons within a network solely to global network effects. At the same time, we expect the coupling of neurons within such a network to encourage the noise adaptation to spread and be modified, thus producing global noise adap-

Noise Adaptation

1063

tation effects that cannot be easily anticipated on the basis of the analysis presented here. Appendix A.1 Density Function of the Generator Potential. Our goal is to obtain the probability density function of V(t) given the boundary condition V(0). Let p(V(t)) be the density function. The integral of this density with respect to the voltage-level parameter v is the cumulative distribution function of V(t), written P{V(t) ≤ v}. We will solve for P{V(t) ≤ v}, then take the derivative with respect to v to find p(V(t)). Here we assume that α = µ(t) = 0, and σ (t) = σ , a constant. When α = 0, there is no membrane leak, and the spike generation mechanism is driven by a regular Brownian motion process. Recall that equation 2.1 describes the evolution of the generator potential between spikes. With α = 0, this equation describes the time evolution of a Brownian motion process. The probabilistic path of the generator potential thus consists of a sequence of Brownian motion paths strung together, one after another, with discontinuous breaks between the Brownian motion paths occurring at the reset times. Each time that the neuron resets, a new Brownian motion path is initiated at voltage level zero, and that particular Brownian motion path is terminated as soon as it exceeds the neural threshold. The first Brownian motion path in the sequence is initialized at the level V(0), which, in general, may be different from zero. Although every generator potential path consists of a sequence of Brownian motion paths, the overall generator potential path will not be Brownian motion path unless no resets of the potential occur along the path. In general, the path structure of the generator potential includes the effects of resets, whereas that of Brownian motion does not. This difference is reflected in the two different stochastic differential equations that describe the time evolution of Brownian motion (see equation 2.1) and that of the generator potential (see equation 2.3). Because the two processes are subject to different probability laws, we need to distinguish them clearly in what follows. We will use B(t) to denote a regular Brownian motion path (without resets and terminated on reaching threshold) and reserve V(t) to signify the generator potential path, which is reset to zero whenever it exceeds θ . Let B(t) be a Brownian motion path that is initialized (by either a reset or the initial boundary conditions) at an arbitrary level v0 at any time s within the interval [0, t]. The probability that B(t) is below level v at time t is then given by the normal integral φ(v − v0 , t − s) ≡ P{B(t) ≤ v | B(s) = v0 } Z v 1 2 2 = √ e−(x−v0 ) /(2σ (t−s)) dx σ 2π(t − s) −∞

(A.1)

1064

Michael E. Rudd and Lawrence G. Brown

(Karlin & Taylor, 1975). We will use this result to compute the distribution function P{V(t) ≤ v} by noting that, at any arbitrary time t, V(t) will be a point on a Brownian motion path that was initialized at the time of the last reset (or at time zero if no reset has occurred). Therefore, P{V(t) ≤ v} will be a sum of probabilities of the form in equation A.1, each weighted by the conditional probability that V(t) is on a path initialized at (v0 , s). The generator potential paths for which V(t) ≤ v can be partitioned into two subgroups. The first group consists of those paths that exceed the neural threshold θ at least once in the interval [0, t]; the second group consists of those that do not. We may therefore write P{V(t) ≤ v} as the sum of two joint probabilities: P{V(t) ≤ v} = P{V(t) ≤ v, no resets} + P{V(t) ≤ v, at least one reset}.

(A.2)

Define the maximum height during the interval [s, t] of a Brownian motion path initialized at (v0 , s) as M(s, t) ≡ max {B(t) | B(s) = v0 }. s≤u≤t

The first term on the right side of equation A.2 can then be rewritten in terms of the Brownian motion process according to the law of total probability: P{V(t) ≤ v, no resets} = P{M(0, t) < θ, B(t) ≤ v | B(0) = V(0)} = P{B(t) ≤ v | B(0) = V(0)} − P{M(0, t) ≥ θ, B(t) ≤ v | B(0) = V(0)}. (A.3) By applying the well-known reflection principle (Karlin & Taylor, 1975, p. 345), it can be shown that P{M(0, t) ≥ θ, B(t) ≤ v | B(0) = V(0)} = P{B(t) ≥ 2θ −v | B(0) = V(0)} = 1 − φ(2θ − v − V(0), t),

(A.4)

the last step in equation A.4 being an application of equation A.1. Substituting equations A.1 with v0 = V(0) and s = 0) and A.4 directly into equation A.3 then yields P{V(t) ≤ v, no resets} = φ(v − V(0), t) + φ(2θ − v − V(0), t) − 1. (A.5) We next derive an expression for the second probability on the righthand side of equation A.2. Note that for every generator potential path that contributes to this probability, there must be one and only one time in the interval [0, t] at which the potential was last reset. We can therefore assign a probability density to each possible last reset time. Let s ∈ [0, t] be the

Noise Adaptation

1065

time of the last reset. This is equivalent to saying that a reset occurs at s, and no further resets occur in the interval [s, t]. We can therefore write the joint probability for the event {last reset at s, V(t) ≤ v} in the form P{V(t) ≤ v, no resets ∈ [s, t] | reset at s}. The probability that we seek will then be an integral over s of terms of this form, each weighted by the average number of resets dN(s) that occur within a small neighborhood of s. But dN(s) =

dN(s) ds, ds

and, for the particular case of a nonleaky integrate-and-fire neuron considered here, dN(s)/ds can be written in terms of the infinite series (3.4). We can therefore write the second probability on the right-hand side of equation A.2 in the form P{V(t) ≤ v, at least one reset} Z tX ∞ fN (s)P{V(t) ≤ v, no resets ∈ [s, t] | reset at s} ds = 0 N=1

=

Z tX ∞

0 N=1

fN (s)P{V(t) ≤ v, no resets ∈ [s, t] | V(s) = 0} ds. (A.6)

The last step in equation A.6 follows from the Markov property of Brownian motion: the probability of any given path over the interval [s, t] depends only on the initial boundary condition V(s) = 0 and not on how the generator potential got there. Note that paths that “accidentally” satisfy the condition of the final conditional probability (i.e., when no reset actually occurs at s) will not be given any weight by the infinite series weighting term, because all of the weight corresponds to paths that reset at s; so paths that accidentally cross the zero line at s will effectively be given a zero weighting. Furthermore, although there is no limit to the number of resets that can occur in the interval [0, t], all resets other than last resets are automatically discounted by the nature of the joint probability within the integrand of equation A.6. Since every path that leads to V(t) will have one and only one last reset time, each path will be counted once and only once, ensuring that the probability in equation A.6 will be properly normalized. Continuing with our derivation of the generator potential density function, we next note that the following two expressions are equivalent: P{V(t) ≤ v, no resets ∈ [s, t] | V(s) = 0} = P{V(t − s) ≤ v, no resets ∈ [0, t − s] | V(0) = 0},

(A.7)

and that the probability on the right-hand side of equation A.7 is trivially obtained by substituting 0 for V(0) and t − s for s in equation A.5 to get

1066

Michael E. Rudd and Lawrence G. Brown

P{V(t) ≤ v, no resets ∈ [0, t − s] | V(0) = 0} = φ(v, t − s) + φ(2θ − v, t − s) − 1.

(A.8)

Combining equations A.6 through A.8 then yields P{V(t) ≤ v, at least one reset} Z tX ∞ = fN (s)(φ(v, t − s) + φ(2θ − v, t − s) − 1) ds.

(A.9)

0 N=1

Finally, we substitute equations A.5 and A.9 into equation A.2 to obtain P{V(t) ≤ v} = φ(v − V(0), t) + φ(2θ − v − V(0), t) − 1 Z tX ∞ fN (s)(φ(v, t − s) + 0 N=1

+φ(2θ − v, t − s) − 1) ds.

(A.10)

The density function of the generator potential is the derivative with respect to v of equation A.10: ³ ´ 1 2 2 2 2 e−(v−V(0)) /(2σ t) − e−(2θ −v−V(0)) /(2σ t) p[V(t)] = √ σ 2π t Z tX ∞ 1 + fN (s) √ σ 2π(t − s) 0 N=1 ³ 2 ´ 2 2 2 × e−v /(2σ (t−s)) − e−(2θ −v) /(2σ (t−s)) ds. (A.11) Because the density function does not have a simple analytic form, its dependence on the parameters of the model can be understood most easily by numerically evaluating the function and plotting it for different parameter values. Since our overarching goal is to understand the mechanism of noise adaptation in the integrate-and-fire neuron, we will restrict our further analysis of equation A.11 to the task of finding its dependence on the root-mean-square noise magnitude σ and adaptation time t. Note that wherever these two variables appear within the expression, they always appear together as part of the unit σ 2 t, which is the time-integrated variance of the Brownian motion process that drives the neuron. It is therefore unnecessary to examine the dependence of the function on the two variables separately. In Figure 4, we plot√equation A.11 for three different levels of the noise standard deviation σ t (with V(0) = 0 and θ = 50). The plots indicate that the spread in the generator potential distribution increases with the standard deviation of the Brownian motion process that drives the neuron. Because the range of V(t) is bounded on the upper end by θ, large variability in V(t) is associated with a high probability of obtaining large, negative voltage values.

Noise Adaptation

1067

Figure 4: Examples of the probability density function (pdf) of the generator potential of a nonleaky integrate-and-fire neuron driven by gaussian white noise with mean zero and constant variance. The three pdfs correspond to three different standard deviations of the time-integrated noise in the Brownian motion process that drives the neuron (indicated on plot). Additional model parameters: θ = 50; α = 0; µ = 0; V(0) = 0.

A.2 The First and Second Moments of the Generator Potential Distribution. The first and second moments of the generator potential density (see equation A.11) can be calculated in a straightforward manner using integral calculus. The first moment is · µ ¶ θ − V(0) V(t)µ=0 = V(0) − θ Erf c √ σ 2t ¶ # µ Z tX ∞ θ ds , (A.12) fN (s)Erf c + √ σ 2(t − s) 0 N=1 where Erf c denotes the complementary error function: 1 − Erf and the error function Erf is defined in the usual way: Z x 2 2 e−z dz Erf (x) = √ π 0 (Gradshteyn & Ryzhik, 1980, p. 930). Recall that we already have an expression for V(t) (see equation 4.7), which was derived on the basis of the stochastic differential equation approach. Setting µ = 0 in that expression should produce an expression that is equivalent to equation A.12. The two expressions are equal if the following general relation holds: ¶ µ ¶ µ Z t k k (A.13) ds = Erf c √ , R(s)Erf √ t−s t 0

1068

Michael E. Rudd and Lawrence G. Brown

where we have defined R(s) =

∞ X nk − N2 k2 e s . √ πs3 n=1

We have checked equation A.13 numerically using Gauss-Legendre 100point quadrature over several orders of magnitude of t and k and have found excellent agreement; however, we have not been able to prove the equality. To write the second moment, we first define r

2t − (θ −v20 )2 e 2σ t (θ + v0 ) ψ(t, v0 ) ≡ 2θv0 − 2θ + σ π µ ¶ θ − v0 + Erf (v20 + σ 2 t + 2θ 2 − 2θ v0 ), √ σ 2t 2

(A.14)

then write V 2 (t)µ=0 = ψ(t, V(0)) +

Z tX ∞ 0 N=1

fN (s)ψ(t − s, 0) ds.

(A.15)

References Brown, L. G., & Rudd, M. E. Evidence for a noise-gain control mechanism in human vision. Manuscript submitted for publication. Calvin, W. H., & Stevens, C. F. (1965). A Markov process for neuron behavior in the interspike neuron. In Proceedings of the 18th Annual Conference on Engineering in Medicine and Biology (vol. 7, p. 118). Capocelli, R. M., & Ricciardi, L. M. (1971). Diffusion approximation and first passage time problem for a model neuron. Kybernetik 8:214–233. Chapeau-Blondeau, F., Godiver, X., & Chambet, N. (1996). Stochastic resonance in a neuron model that transmits spikes. Physical Review E, 53:1273–1275. Geisler, C. D., & Goldberg, J. M. (1966). A stochastic model of the repetitive activity of neurons. Biophysical Journal 6:53–69. Gerstein, G. L. (1962). Mathematical models for the all-or-none activity of some neurons. Institute of Radio Engineers Transactions on Information Theory 8:137– 143. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysical Journal 4:41–68. Gluss, B. (1967). A model for neuron firing with exponential decay of potential resulting in diffusion equations for probability density. Bulletin of Mathematical Biophysics 29:233–243. Gradshteyn, I. S., & Ryzhik, I. M. (1980). Table of integrals, series, and products. New York: Academic Press. Holden, A. V. (1976). Models of the stochastic activity of neurons. Berlin: SpringerVerlag.

Noise Adaptation

1069

Johannesma, P. I. M. (1968). Diffusion models for the stochastic activity of neurons. In E. R. Caianiello (Ed.), Neural Networks. Berlin: Springer-Verlag. Karlin, S. M., & Taylor, H. M. (1975). A first course in stochastic processes. New York: Academic Press. Parzen, E. (1962). Stochastic processes. San Francisco: Holden-Day. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology. Berlin: Springer-Verlag. Ricciardi, L. M., & Sacerdote, L. (1979). The Ornstein-Uhlenbeck process: A model for neuronal activity. Biological Cybernetics 35:1–9. Roy, B. K., & Smith, D. R. (1969). Analyis of the exponential decay model of the neurons showing frequency threshold effects. Bulletin of Mathematical Biophysics 31:341–357. Rudd, M. E. (1996). A neural timing model of visual threshold. Journal of Mathematical Psychology 40:1–29. Rudd, M. E., & Brown, L. G. (1993). A stochastic neural model of light adaptation. Investigative Ophthalmology and Visual Science 34:1036. Rudd, M. E., & Brown, L. G. (1994). Light adaptation without gain control: A stochastic neural model. Investigative Ophthalmology and Visual Science 35:1837. Rudd, M. E., & Brown, L. G. (1995). Temporal sensitivity and threshold vs intensity behavior or a photon noise-modulated gain control model. Investigative Ophthalmology and Visual Science 35:S16. Rudd, M. E., & Brown, L. G. (1996). Stochastic retinal mechanisms of light adaptation and gain control. Spatial Vision 10:125–148. Rudd, M. E., & Brown, L. G. (In press). A model of Weber and noise gain control in the retina of the toad Bufo marinus. Vision Research. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophysical Journal 5:173–194. Stein, R. B. (1967). Some models of neuronal variability. Biophysical Journal 7:37– 68. Thomas, M. U. (1975). Some mean first passage time approximations for the Ornstein-Uhlenbeck process. Journal of Applied Probability 12:600–604. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology: Vol. 2. Nonlinear and stochastic theories. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: SIAM. Uhlenbeck, G. E., & Ornstein, L. S. (1930). On the theory of Brownian motion. Physical Review 36:823–841. Wan, F. Y. M., & Tuckwell, H. C. (1982). Neuronal firing and input variability. Journal of Theoretical Neurobiology 1:197–218.

Received January 16, 1996; accepted October 21, 1996.

Communicated by Mikhail Tsodyks

Paradigmatic Working Memory (Attractor) Cell in IT Cortex Daniel J. Amit Stefano Fusi Racah Institute of Physics, Hebrew University, Jerusalem, and INFN, Istituto di Fisica, Universit`a di Roma, Italy

Volodya Yakovlev Department of Neurobiology, Hebrew University, Jerusalem, Israel

We discuss paradigmatic properties of the activity of single cells comprising an attractor—a developed stable delay activity distribution. To demonstrate these properties and a methodology for measuring their values, we present a detailed account of the spike activity recorded from a single cell in the inferotemporal cortex of a monkey performing a delayed match-to-sample (DMS) task of visual images. In particular, we discuss and exemplify (1) the relation between spontaneous activity and activity immediately preceding the first stimulus in each trial during a series of DMS trials, (2) the effect on the visual response (i.e., activity during stimulation) of stimulus degradation (moving in the space of IT afferents), (3) the behavior of the delay activity (i.e., activity following visual stimulation) under stimulus degradation (attractor dynamics and the basin of attraction), and (4) the propagation of information between trials—the vehicle for the formation of (contextual) correlations by learning a fixed stimulus sequence (Miyashita, 1988). In the process of the discussion and demonstration, we expose effective tools for the identification and characterization of attractor dynamics.1 1 Introduction 1.1 Delay Activity in Experiment. A number of cortical areas, such as the inferotemporal cortex (IT) and prefrontal cortex, have been suggested as part of the working memory system (Fuster, 1995; Miyashita & Chang, 1988; Nakamura & Kubota, 1995; Wilson, Scalaidhe, & Goldman-Rakic, 1993). The phenomenon is detected in primates trained to perform a delay match-tosample (DMS) task, or a delay eye-movement (DEM) task, using, in each case, a relatively large set of stimuli. It is observed in single-unit extracellular recordings of spikes following, rather than during, the presentation of a 1 A color version of this article is found on the Web at: http://www.fiz.huji.ac.il/staff/acc/faculty/damita

Neural Computation 9, 1071–1092 (1997)

c 1997 Massachusetts Institute of Technology °

1072

Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev

sensory stimulus. In these tasks, the behaving monkey must remember the identity or location of an initial eliciting stimulus in order to decide on its behavioral response following a second stimulus. This latter test stimulus is, with equal likelihood, identical to or different from the first stimulus in the DMS task and is simply a “go” signal in the DEM task. In these areas of the cortex, neurons are observed, in rather compact regions of the same area (called modules or columns), to have reproducible elevated spike rates during the delay period, after the first, eliciting stimulus has been removed.2 Such elevated rate distributions have been observed to persist for long times (compared to neural time constants)—as long as 30 seconds. The rates observed are in the range of about 10–20 spikes per second (s−1 ), against a background of spontaneous activity of a few s−1 . The subset of neurons that sustain elevated rates, in the absence of any stimulus, is selective of the preceding, first, or sample stimulus. Thus, the distribution of delay activity could act as a neural representation or memory of the identity of the eliciting stimulus, transmitted for later processing in the absence of the stimulus. Moreover, in one experiment in which training was carried out with eliciting stimuli presented in a fixed order, it was observed that though the stimuli were originally generated to be uncorrelated, the resulting learned delay activity distributions (DADs), corresponding to the uncorrelated stimuli, were themselves correlated (Miyashita & Chang, 1988). That is, there was an elevated probability that the same neuron would respond with an elevated delay activity for two stimuli that were close in the sequence. In fact, the magnitude of the correlations expressed the temporal separation of the stimuli in the training sequence: the closer two stimuli were in the training sequence, the higher the correlation of the corresponding two DADs. (For a detailed discussion, see Amit, 1994, 1995; and Amit, Brunel, Tsodyks, 1994.) 1.2 The Attractor Picture. These experimental findings have been interpreted as an expression of local attractor dynamics in the cortical module: A comprehensive picture has been suggested (Amit 1994, 1995; Amit et al., 1994) that connects the DAD to the recall of a memory into a working status and views the DAD as a collective phenomenon sustained by a synaptic matrix formed in the process of learning. The collective aspect is expressed in the mechanism by which a stimulus that had been learned previously has left a synaptic engram of potentiated excitatory synapses connecting the cells driven by the stimulus. When this subset of cells is reactivated, they cooperate to maintain elevated firing rates in the selective set of neurons, via the same set of potentiated synapses, after the stimulus is removed. In this way, these cells can provide each other, that is, each one of the group, with an afferent signal that clearly differentiates the members of the group from

2 A similar phenomenon has been observed in the motor cortex by Georgopoulos (private communication).

Paradigmatic Working Memory Cell

1073

other cells in the same cortical module. Theoretical and simulation studies have demonstrated that this signal may be clear enough to overcome the noise generated by the spontaneous activity of all other cells, provided that not too many stimuli have been encoded into the synaptic system of the module. A large number of “correct” neurons must collaborate to sustain the pattern. A rough estimate is that about 1000–2000 cells would collaborate in a given DAD (Brunel 1994), out of some 100,000 cells in a column of 1 mm2 . There would be considerable overlap between DADs. The collective nature of the DAD is related to its attractor property. Since so many cells collaborate, the DAD is robust to stimulus “errors.” That is, even if a few of the cells belonging to the self-maintaining group are absent from the group initially driven by a particular test stimulus, or if some of them are driven at the “wrong” firing rates, once the stimulus is removed, the synaptic structure will reconstruct the “nearest” distribution of elevated activities of those it had learned to sustain. The reconstruction (the attraction) will succeed provided the deviation from the learned template is not too large. All stimuli leading to the same DAD drive activities in the basin of attraction of that attractor. Each of the stimuli presented repeatedly during training creates an attractor with its own basin of attraction (see Amit & Brunel, 1995). In addition, the same module can have all neurons at spontaneous activity levels, if the stimulus has driven too few of the neurons belonging to any of the stimuli previously learned. 1.3 Attractors and Learning Correlations. According to the attractor picture, attractors are established through the learning process. In experiments in which monkeys performed a DMS task using as many as 100 stochastically generated visual stimuli, selective delay activities were formed for all images repeatedly used in the training phase. This finding supports a dynamic interpretation for the delay activity and confirms the presence of a learning process, since it is rather unlikely that synaptic structures related to these stimuli had been generated prior to training. In fact, no associated delay activities were found for new images—not used during training—but introduced subsequently in the testing stage (Miyashita & Chang, 1988). Moreover, the correlations between DADs for sequentially appearing stimuli in the fixed training order, corresponding to uncorrelated images, are even more directly related to learning, inasmuch as these correlations are dependent on the particular fixed order in which the stimuli appeared in the particular training protocol. Learning these correlations presents a puzzle that is naturally resolved in the attractor picture: How does the information contained in one image (used as the first, eliciting, stimulus in one trial) propagate, in the absence of the image, until the next image in the sequence is presented, several seconds later, as the first stimulus in the next trial? Attractors provide a natural vehicle for the propagation of this information (Griniasty, Tsodyks, & Amit, 1993; Amit et al., 1994; Amit, 1995), as follows: During training, there are initially no delay activity attractors. If

1074

Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev

the stimuli in the training set are uncorrelated, as DADs are first formed they are necessarily uncorrelated because there is no way to communicate information between successive stimuli. Once the uncorrelated DADs have been formed, however, these DADs themselves may be used to propagate information between trials. Recall that in half of the trials, the second stimulus is identical to the first (so as to maintain an unbiased probability of same and different trials). This second stimulus may also leave behind a delay activity distribution, during the intertrial interval, identical to the DAD excited by the first stimulus, during the interstimulus interval of the trial. If this delay activity is not disturbed by the motor response of the animal (and other unmonitored visual or nonvisual events in the intertrial interval), then it may persist through the intertrial interval and will be present in the network when the next image in the sequence is presented as the first stimulus in the successive trial. This delay activity will be present in the sense that there will be a time window in which the cells of the delay activity are still at elevated rates, while the cells corresponding to the new stimulus begin to be driven. Information is then available for learning the correlation of the two (Brunel, 1996).

1.4 Experimental Questions. In this study, we outline an experimental program. We show recordings, and results of their analysis, from a paradigmatic cell—an IT neuron with a clear delay activity, which we were successful in holding for a long, stable recording session, sufficient for a detailed series of tests. Given the very detailed recordings of selective delay activities reported by Miyashita and Chang (1988); Miyashita (1988); Sakai and Miyashita (1991); Wilson et al. (1993); and Nakamura and Kubota (1995), on the one hand, and the detailed picture of attractor networks (Amit, 1994, 1995; Amit et al., 1994) on the other, one naturally asks what would be expected of a cell that participates in an attractor net. We ask the question in this article, and describe in detail the expected answer by demonstrating the results from this particularly rich paradigmatic neuron. Our goal is not to present data that give a definitive answer to this question of whether IT delay activity reflects an attractor network serving stimulus image memory. For that we would have to record and analyze many cells from more than one monkey. We wish instead to raise the issues quantitatively and point out, by considering in detail data from a single example, what experiments need to be carried out and what behavior one must expect from neurons participating in an attractor neural net. The central issue with which we deal is neuronal behavior following presentation of a degraded stimulus. This is because the attractor picture naturally implies basins of attraction for every attractor. That is, it predicts that although the input to IT will be different for stimuli of different degrees of degradation, the attractor, that is, the delay activity distribution within IT, will always be the same, as long as the degraded stimulus is not too far from

Paradigmatic Working Memory Cell

1075

the original. Thus, the attractor theory predicts very different characteristics for the response of IT neurons during stimulation (reflecting the response driven by the input to IT) and the response following stimulation (reflecting the attractor state). Testing empirically the attractor picture raises a set of issues: 1. IT observability. Are the stimuli used in these experiments identifiable at the level of the relevant cortical region, namely, IT? Do these stimuli induce neuronal responses in the recorded area that are clearly higher than spontaneous, as well higher than the delay activities? Are these responses different for the different stimuli? 2. Significance of prestimulus activity. Interstimulus interval (ISI) activity is identified in relation to activity immediately preceding the first stimulus of a trial. Is the latter merely spontaneous activity, or could it be delay activity following the second stimulus in the preceding trial? The question is particularly pertinent when trials are organized in series. 3. Moving in the space of IT-observable stimuli. What is the effect of stimulus degradation, for example, using degraded or noisy stimuli? What is the effect on visual responses (during the stimulus) and on attractor dynamics, as expressed in a given cell? 4. Basin of attraction. Does motion in IT-observable space bring about an abrupt change in delay activity? 5. Information transport between consecutive trials. Can delay activity act as the agent for the generation of correlations between sequentially appearing stimuli, as in Griniasty et al. (1993) and Amit et al. (1994)? 2 Methods 2.1 Preparation, Stimuli, and Task. A rhesus monkey (Macaca mulatta) was trained to perform a DMS task on 30 stochastically generated images, 15 fractals, and 15 Fourier descriptors; much as in Miyashita (1988) and Miyashita, Higuchi, Sakai, and Masui (1991), the two types were randomly intermixed. Three examples of the stimuli are shown in Figure 1: from left to right, two fractals and a Fourier descriptor. The left two images, the fractals, are the ones we used in recording most of the data reported here. The behavioral paradigm was as follows. Following the appearance of a flickering fixation point on the monitor screen in front of the monkey, the monkey was to lower a lever and keep it in the down position for the duration of the trial (see Figure 2). After the lever was lowered and following

1076

Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev

Figure 1: Three of the 30 color images used in the DMS experiments. (Left) Fractal stimulus, which induced the clearest delay activity for the neuron of Figure 3 (number 5 in the figure) and therefore called the best stimulus for this neuron. (Center) Fractal stimulus that induced delay activity indistinguishable from the average prestimulus activity, as demonstrated in Figure 3 (number 17 in the figure), and is therefore called the worst stimulus for this neuron. (Right) A Fourier descriptor type stimulus, which induced response 14 in Figure 3.

a 1000 ms delay, the first (eliciting, or sample) stimulus was presented on the screen for 500 ms. The ISI was 5 seconds. Following this delay, the second (test) stimulus was shown, also for 500 ms. This second test stimulus was the same as the first sample stimulus in half of the trials and different in the other half (chosen randomly from the other 14 stimuli of its type, fractal or Fourier descriptors). After the second stimulus was turned off, the monkey had to shift the bar left or right depending on whether the second stimulus was the same as the first or different, respectively, and then, when the fixation point stopped flickering, to release the bar. Correct responses were rewarded by a squirt of fruit juice (see Figure 2). The monkey had been trained to a performance level not less than 80 percent correct responses when the reported experiment took place. 2.2 Recording. We recorded extracellularly spike activity from single neurons of the inferotemporal cortex. The experiment is still in progress so we cannot give the histological verification of the recorded cell. However, the coordinates of the position of the guide tube were A14 L15 during vertical penetration into the ventral cortex. The experimental procedures and care of the monkey conformed to guidelines established by the National Institutes of Health for the care and use of laboratory animals. After a cell was isolated by a six-channel spike sorter (Multi-Spike Detector; ALPHA OMEGA Engineering, Nazareth, Israel), the experiment proceeded in two stages. During the first stage, all 30 stimuli were presented one to three times. Average PST (peri-stimulus) response histograms were produced online, as demonstrated in Figure 3. From these, the best and worst stimuli were determined by eye. The best stimulus was that which evoked the clearest excess of the average rate in the ISI over the average rate in the period immediately preceding the first stimulus (the prestimu-

Paradigmatic Working Memory Cell

1077

Figure 2: Schematic sequence of events in each phase of DMS task.

lus rate: stimulus 5 in Figure 3, corresponding to Figure 1, left). The worst stimulus was selected as one that produced an overall weak response and an indistinct difference between prestimulus and delay period firing rate (stimulus 17 in Figure 3, corresponding to Figure 1, center). At this stage, the prestimulus firing rate was interpreted naively as the average activity in a 1.0 second interval prior to the presentation of the trial’s first stimulus. For the second stage of the experiment, a new image set was created from the chosen best and worst stimuli. The red-green-blue (RGB) pixel map of each image was degraded by superimposing uniform noise at one of four levels. That is, we generated from each of these two images a set of five images: the pure, original one (referred to as degradation 0) and four degraded images, each at an increasing level of degradation (levels 1–4). The DMS task was then resumed, using as the first (sample) stimulus in the DMS task a randomly selected image from the new set of 10 (original and degraded) images. The second stimulus, the test image, was randomly selected as an undegraded version of either the best or the worst stimulus. In Figure 4 we reproduce the best stimulus and degraded versions of this image at each of the four degradation levels used in this second stage of the experiment. 2.3 Mean Rate Correlation Coefficient. In order to test various hypotheses concerning the structure and origin of persistent activities, we introduce a coefficient measuring the correlation of the trial-by-trial fluctuations of the mean firing rates—or spike counts—in two different intervals. The mean rate correlation (MRC) is defined as follows: xn and yn (n = 1, . . . , Nst —the total number of selected trials) are, respectively, the spike counts in two

1078

Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev

Figure 3: Spike rate histograms (full scale = 50 s−1 ) for a small number of trials of the entire set of 30 color stimuli, used online for the selection of best and worst stimuli. Each window is divided into six temporal intervals; the neuron’s activity is binned and averaged for the different trials with the same stimulus. From left: Prestimulus interval (66 ms per bin for the 1000 ms just prior to the first stimulus of the trial); first, or sample, stimulus period (33 ms per bin for 500 ms); delay (ISI) interval (125 ms per bin for 5 sec); second, or test, stimulus period (showing the response for the one-half of the trials when the stimulus was the same as the first; 33 ms per bin for 500 ms); post–second stimulus interval (showing the activity when the second stimulus was the same as the first; 66 ms per bin for 1000 ms). Finally we show the response to the second stimulus of other trials (when the first stimulus was not that of this window, but the second stimulus was that of this window (so that the trial was a nonmatch trial; 33 ms per bin for 500 ms). The horizontal line over each interval is the average rate in the interval. On the right of every window are stimulus number, number of trials with this image as first stimulus ( f ); number of trials with the same stimulus as second stimulus in the cases of same (s) and different (d) trials. Best is stimulus 5, the one evoking the highest excess of the average rate in the ISI over the average prestimulus rate; worst is stimulus 17, the one having an overall weak response. Vertical ticks are 10 spikes/sec.

Paradigmatic Working Memory Cell

1079

Figure 4: Image degradation (in color) as gradual motion in stimulus space for testing the attractor dynamics of delay activity. From left to right: pure best image (0 noise), then four levels of noise on RGB map.

intervals in the nth trial. Then: h(x − hxi)(y − hyi)i MRC = p (hx2 i − hxi2 )(hy2 i − hyi2 )

(2.1)

where h· · ·i denotes the estimate of the expectation of a random variable performed over all the selected trials, that is, hxi =

Nst 1 X xi . Nst i=1

This coefficient measures the extent of the correlation between the deviations from the mean over the selected set of trials, of the spike counts in each of the two intervals. With this tool, we monitor whether high rates during visual stimulus presentation imply high rates in the delay period and low imply low, whether high (low) counts at the beginning of a delay period imply high (low) counts in later intervals in the delay, and whether high (low) counts following a given second stimulus are related to high (low) counts in the period immediately preceding the first stimulus of the succeeding trial, its prestimulus interval. The value of the MRC is in the interval [−1, 1]. The standard of comparison for the magnitude of this coefficient will be established by considering subclasses of trials in which the average rates in both intervals are both high or both low. For instance, the MRC of the subset of responses to the worst stimulus is calculated by performing the expectation of equation 2.1 over the trials in which the worst stimulus was presented as a sample. In this case, if the fluctuations about the mean spike count are random and hence uncorrelated, the MRC coefficients will set the scale for lack of correlation. As we shall see, large values of MRC can capture two different phenomena: (1) trial-by-trial correlation of the fluctuation of the individual counts about their mean and (2) the existence of subsets of trials with very different mean counts in each, while within each subset, fluctuations about the mean may even be uncorrelated.

1080

Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev

3 Experimental Testing and the Special Cell The issues raised in section 1.4 will now be discussed in the same order in terms of the responses of a single IT cell. The intention is not to claim that the data presented here give a definite answer to these questions. That would require the recording and analysis of many cells and, traditionally, in more than one monkey. The objective, rather, is to raise the questions, the problems, and the potential answers by considering in detail the record of a particularly rich cell that exhibits behavior in relation to which we may pose most of the problems and demonstrate the form answers may take. Following continued recording from numerous neurons in this and another monkey, we will be able to report whether this area of IT may in fact serve this memory function for DMS tasks. Alternatively, we may report that too few IT neurons have the expected significant DAD activity to be able to serve this function under the conditions used in our experiment, and it would be best to look for such DADs elsewhere. Finally, we may find that delay activity is present in IT but does not generally have the properties we expect (and which are reported here for this single cell), so that our understanding of cortical dynamics should best be revised. 3.1 Identification of the Stimulus. Delay activity distributions are likely to be sustained by the synaptic structure, learned or preexisting. The fact that the delay activities are selective to the preceding stimulus implies that the synaptic structure can sustain a variety of distributions in the absence of a stimulus. Which of the stable delay activities is actually propagating depends on which was the last effective stimulus. But different stimuli presented on the monitor screen may not actually lead to different neuronal responses in deep structures (higher cortical areas) such as the IT cortex, since the differences may well be filtered out at afferent levels. It is the representation of the stimulus as it affects the cells of IT that is relevant for which of the DADs is excited or learned. Thus, for example, it may be that differences of scale or color in the external image may result in very similar distributions of afferents when arriving at the IT cortex. In that case, they cannot elicit different DADs. In other words, stimuli and stimulus differences must be IT observable. (For some recent work on observability of stimulus variation in IT, see Kovacs, Vogels, & Orban, 1995.) What can give rise to different DADs are significantly different distributions of synaptic inputs that arrive from previous cortical areas to IT. These distributions, in turn, can be observed by recording the neuronal activity in the attractor network during presentation of the stimulus. In fact, the IT cortex is particularly propitious in this respect because rates observed during the presence of the stimulus in visually responsive cells that participate in a DAD are much higher than those observed for the same cells during interstimulus intervals of the trial. (See, e.g., Miyashita & Chang, (1988), and responses in the underlined intervals in the histograms for best stimu-

Paradigmatic Working Memory Cell

1081

lus in Figure 5.) This allows a very clear identification of the stimulus, for comparison with the attractor. In other cortical areas, such as the prefrontal cortex, this may not be as clear (see, e.g., Wilson et al., 1993; and Williams & Goldman-Rakic, 1995). The range of responses during the stimulus and during the ISI, as in Figures 3 and 5, exhibits IT observability of the stimuli as well as of their differences. Thus, the stimuli used for this experiment, the 30 fractals and Fourier descriptors, are IT observable in that the responses to some of them are much higher than the background spontaneous rates, as well the rates during the delay period ISI. This difference from background will be measured quantitatively below. In addition, it is clear that our neuron also differentiates between stimuli (e.g., between the best and the worst stimuli) during and following the stimulation.

3.2 Delay Activity and Spontaneous Activity. It is quite common to consider the ongoing activity before the first stimulus (the prestimulus activity) in a given cell as spontaneous activity. Delay activity is considered significant if its rate is significantly higher than the prestimulus rate. In this way, one observes in Figure 5, in the first three windows on the left of the best stimulus row, that the average rate (the horizontal line) during the delay (the last, wide interval) is higher than in the prestimulus interval of the current trial (see the numbers in the table below the histograms). They become equal in the two windows on the right (see section 4). In the leftmost window of the best row of histograms, corresponding to the undegraded best stimulus, the ratio ISI/pre-stim rates is 13/8, which may not be considered very convincing as a signal-to-noise ratio. But in a context in which delay activities exist, prestimulus rates are not necessarily spontaneous. For example, if the second stimulus in a trial evokes a DAD when presented as a first stimulus and if the neuron being observed has an elevated rate in this DAD, then one would expect that following the second stimulus, this neuron would have an elevated rate. In fact, this is observed in Figure 6, where we plot histograms of spike rates that follow presentation and removal of the best (top) and the worst (bottom) stimuli as second stimulus. These histograms are averages over all trials at all levels of the first stimulus degradation, since the second stimulus is always undegraded. What one observes is that following the best second stimulus, the level of post–second-stimulus persistent rate is as high as the delay activity for the first three levels (0–2) of degradation in Figure 5 (no degradation and the next two levels of degradation). This is true even when the first stimulus, due to its degradation, does not provoke significant delay activity. It is consistent with the fact that this persistent activity is provoked by the second stimulus that is never degraded. Another expression of the same fact is presented in Figure 7. These are the trial-by-trial distributions of the post–second stimulus average rate for best

1082

Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev

Figure 5: Testing attractor dynamics: spike rate histograms (rate scale 10 s−1 , bins 50 ms) for best (top row) and worst (second row) first stimulus, each original, and four levels of degradation. Each histogram is divided into three intervals. From left: prestimulus of the current trial, first stimulus, delay interval. The horizontal line over each interval is average rate in the interval. The value of the average rates for each interval is reported in the table below the histograms with the mean rates and the standard errors (in s−1 ). The delay interval (ISI) has been divided in five parts of 1 s each. The mean rate of the activity during the entire ISI is in the last column on the right. For both best and worst first stimulus, one observes a variation of the rate during stimulation (the visual response) as a function of the level of degradation. For the first three windows in the “best” row, there is significant invariant delay activity. In the last two levels of degradation, the delay activity drops and becomes indistinguishable from the delay rate for the “worst” stimuli. The rate in the first second of the ISI, following a strong visual response, is significantly higher than the mean rate of the delay interval. This is due to the latency of the external stimulus. (See section 3.5.)

Paradigmatic Working Memory Cell

1083

Figure 6: Testing information transmission across trials: spike rate histograms for best and worst second stimulus and the corresponding table of mean rates (as in Figure 5). The histogram windows are divided in four intervals: second stimulus (underlined), postsecond stimulus, central part of intertrial interval (invisible), and the prestimulus of the successive trial. Note that the prestimulus activity of the trial following best second stimulus is significantly higher than that following the worst second stimulus. It carries the information about the second stimulus of the previous trial (see the text and Figure 7).

and worst stimuli (plotted above and below the axis, respectively, though both are positive valued, of course).3 The distributions differ significantly according to the T-test: P(Tst)< 10−5 . If this elevated rate were to continue beyond the monkey’s behavioral response, it would arrive at the following trial as an elevated rate and appear

3 In calculating the average poststimulus rate, we take an interval starting 250 ms after the second stimulus is turned off, to exclude latency of response to the stimulus. (See section 3.5.)

1084

Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev

0.4 0.2 0 -0.2 -0.4 0

5

10

15

20

25

30

0

5

10

15

20

25

30

Figure 7: Trial-by-trial rate distributions of post–second stimulus (left) in s−1 and pre–first stimulus of the next trial (right) for best (continuous line) and worst (dashed line, for clarity in negative values). The two distributions differ significantly in the second case, carrying the information about the second stimulus of the trial. The poststimulus interval is on average 710 ms, starting 250 ms after the stimulus, to avoid latency effects.

as an elevated prestimulus activity before the next trial. There are previously reported indications that delay activity does in fact survive motor reaction events (Nakamura & Kubota, 1995), though here there is a novelty in that the postreaction activity is not motivated by the behavioral paradigm. As a test of this possibility we separated the average prestimulus rates over the entire set of trials (N = 123 trials) into two sets, according to the preceding second stimulus (see Figure 6). It turns out that the average prestimulus rate in the group of trials following trials in which the second stimulus was the best stimulus (N = 62) is 8.6 s−1 while for those following trials where the second stimulus was the worst stimulus (N = 61), the average prestimulus rate is 5.7 s−1 . (This difference is significant according to the T-test: P(Tst)< 10−5 .) Yet another test of this effect will be discussed in section 3.6. A simple tool suggests itself for improved online monitoring of the relevant prestimulus activity: to reduce (though not fully eliminate) the effect of persistent post–second stimulus high rate, one should consider the average prestimulus activity over all trials, including all stimuli. This will help, provided the number of DADs in which the cell under observation participates is relatively low. That would be the typical case.4 (See also Miyashita & Chang, 1988.) 3.3 Attractor Dynamics. Attractor dynamics (associative memory) as a description of persistent DADs implies that each DAD is excited by a whole class of stimuli. Each of these stimuli raises the rates of a sufficient number of cells belonging to the DAD so that via the learned collateral synaptic ma4

We are indebted to Nicolas Brunel for this observation.

Paradigmatic Working Memory Cell

1085

trix, they can excite the entire subset of neurons characterizing the attractor and maintain them with elevated rates when the stimulus is removed. The class of stimuli leading to the same persistent distribution is the basin of attraction of that particular attractor. As one moves in the space of stimuli, at some point the boundary of the basin of attraction of the attractor is reached. Moving even slightly beyond this boundary, the network will relax either into another learned attractor or to the grand attractor of spontaneous activity. (See Amit & Brunel, 1997; and Amit, 1995.) To test the validity of the attractor picture one has to be able to do the following: 1. Move in the space of stimuli by steps that are not too big, so that there is a choice between staying inside the basin and moving out. Recall that moving in this space does not mean only changing the stimuli, but changing them in a way that is IT observable. 2. Find IT-observable changes in stimuli and have the delay activity unchanged. 3. Arrive at IT-observable changes in stimuli that will go over the edge (i.e., will not evoke the same DAD). A convincing demonstration of these phenomena would require recording a large number of cells, especially since most cells with high rates in the same DAD (corresponding to the same stimulus) are predicted to lose their rates together. The prototypical behavior of such cells expresses itself on our single cell. First, Figure 5 demonstrates the required motion in stimulus space at the level of the IT cortex. In the five windows for the best stimulus as well as in those for the worst, going from left to right, corresponding to increasing the degradation level as presented in Figure 4, there is a clear variation in the visual response: the neuron’s spike rate during stimulation. This implies that the choice of the degradation mode is an IT-observable motion in stimulus space, as required in point 1 above. Second, in the top three windows in the best stimulus row, the delay activity is essentially invariant, as would have been expected if the stimulus were moving within the basin of attraction of that DAD. This corresponds to point 2 above. Third, as the level of degradation increases to reach the right-most two windows for the best stimulus, the delay activity disappears (see Figure 5). The average rate in the delay period becomes that of the reduced prestimulus activity discussed above, in accordance with the expectation in point 3 (see the table in Figure 5). This rate is also the same as that during the delay in all five windows corresponding to the worst stimulus. The discussion of point 1 calls for a comment concerning the fact that as the best image is initially degraded, the visual response increases. Only in the fourth and fifth levels of degradation (3, 4) does the visual response decrease. This may seem contradictory, in that one might naively expect

1086

Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev 20 18 16 Delay 14 rate 12 (s 1) 10 8 6 4 2

1

0

4

20

2

3

30

40

50

60

70

80

Stimulus rate (s 1)

Figure 8: Average delay activity rate (in s−1 , y-axis) versus average stimulus response (in s−1 , x-axis) for five levels of degradation of best stimulus. The degradation level is indicated above each data point (0 for pure image). Error bars are standard errors. Note that the stimulus response difference between 1 and 0, where delay activity is essentially constant, is larger than between 0 and 3, where delay activity collapses. Recall that 0 implies pure stimulus.

that the visual response should be maximal for the pure image and decrease monotonically with degradation. But on second thought, it should be clear that there is no reason that a particular cell should be maximally tuned to the image rather than to a degraded version of it. In fact, looking at other cells, we find that although there is not a strong systematic trend for the visual response as a function of the degradation level, on the average the visual response decreases with degradation. Nor is there any argument why there should be (Yakovlev, Fusi, Amit, Hochstein, & Zohary, 1995). What is essential is that there be a set of neurons that have a strong visual response and a corresponding significant elevated delay rate, and when the images change moderately, the delay activity remain invariant and when they change enough, the delay activity change abruptly. Our paradigmatic cell would have belonged to such a class had it existed. This cell gives yet two more strong signals of the attractor property. Looking at the visual response to the degraded best stimulus in decreasing order of rates suggested above (2–1–0–3–4), one observes in the table in Figure 5 and in Figure 8 that the rate difference from 1–0 is 14.5 s−1 and there is no noticeable change in delay activity. Yet going from 0–3 the visual response changes by 13.1 s−1 , and the delay activity collapses. In other words, the decrease in the delay activity seems precipitous, as a function of the visual response, as would befit a watershed at the edge of the basin of attraction. The second signal is related to the fact that in the images used for testing the attractor property (see Figure 4), as the degradation noise is introduced, a circle appears around the figure (for technical reasons). This might have been perceived by the monkey as a different stimulus. That is, as the figure gets increasingly blurred, it is the circle that is most visible. Yet the delay ac-

Paradigmatic Working Memory Cell

1087

Table 1: Correlations of Delay Activity in Subintervals of ISI First Stimulus Selection

Degradation Levels

Time Intervals

Number of Trials

MRC

B+W B B B W

0-1-2 0-1-2-3-4 0-1-2 3-4 0-1-2

ISI1-ISI5 ISI1-ISI5 ISI1-ISI5 ISI1-ISI5 ISI1-ISI5

74 61 36 25 38

0.399 0.382 0.080 −0.123 0.089

Note: The intervals are the first 1 second following the removal of the stimulus and the last 1 second of the delay interval.

tivity for the first three degradation levels (0–2) of the best stimulus remains unchanged with the appearance of the circle. The delay activity disappears when the degradation level crosses the critical level, despite the continued presence of the circle. Moreover, the circle produces no effect for the worst stimulus, despite the fact that it is visible there as well. This may be interpreted as due to the fact that the degradation circle appears only during testing and is not seen often enough to have been learned. 3.4 Delay Interval MRC. To test the attractor interpretation more closely, we calculated the MRC (see section 2.3) between the average rates in the 1 second immediately following the first stimulus and the 1 second immediately preceding the second stimulus. The results are summarized in Table 1. In a subset of trials with best and worst first stimuli and the first three degradation levels (0–2) we find MRC = 0.399. Note that in this subset of 74 trials, there are 36 trials with elevated delay activity and 38 with low delay activity. In other words, the high MRC may be attributed to the presence of two underlying subpopulations with different means. Similarly, in the subset of trials with best first stimuli and all degradation levels (second row in the table), there are 61 trials, of which 36 have elevated delay activity and 25 do not, and MRC = 0.382. The 61 trials with best first stimulus are then divided into two subsets: one for the first three degradation levels (Nst = 36) and one for the last two (Nst = 25). In the first subset there is elevated delay activity; there is none in the second. For these two subsets we find MRC = 0.080 and −0.123, respectively. We conclude that the average delay rates in the first subset of trials remain high throughout the ISI and low in the second set. The deviations from the mean in each subset, at the beginning and the end of the ISI, are uncorrelated. This constitutes an example of the first option described at the end of section 2.3 and provides scales for high and low correlations.

1088

Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev

Table 2: Single Neurons versus Attractor Dynamics: Correlations Between Visual Response and Several 1-Second Intervals of the ISI First Stimulus Selection

Degradation Levels

Time Intervals

Number of Trials

MRC

B B+W B B

0-1-2 0-1-2-3-4 0-1-2 0-1-2

ST1-ISI1 ST1-ISI1 ST1-ISI4 ST1-ISI5

36 123 36 36

0.407 0.784 0.085 −0.098

Note: First two rows have high MRCs due to latency of visual response. No correlation in later intervals.

3.5 Single Neurons versus Attractors Dynamics. An alternative scenario to the attractor picture may attribute the enhanced delay activity to some change in the internal state of single cells. Such a scenario would imply that the change in the internal state of the cell is triggered by the level of the visual response. This would be required to make the persistent delay activity stimulus selective. It would further imply a correlation between the rate during the ISI and the rate during the visual response. Table 2 summarizes the MRCs between the visual response to the first stimulus and various 1 second intervals of the ISI. We find that the rate during the visual response to the first stimulus is correlated with the rate in the first 1 second interval of the ISI (MRC = 0.407), even in a subset of trials with stimuli all leading to elevated delay activity, that is, the 36 trials corresponding to the best stimulus at degradation levels 0–2. This high MRC in a trial set with a narrow distribution of rates indicates that the trial-by-trial fluctuations of the rate in the first 1 second of the ISI are correlated with the fluctuations in the rate during the visual response. In fact, if the effect of splitting the means is added, by computing the MRC of the visual response rates with the rates in the first 1 second of the ISI for all trials (Nst = 123), we find MRC = 0.784. These high correlations we associate with the latency of the visual response. In fact, it appears from Figure 5 that the visual response penetrates about 250 ms into the ISI. Next we take the same set of 36 trials (best with degradation 0–2), which gave MRC = 0.407 between the rate in the visual response and the rate of the first 1 second of the ISI. The MRCs between the rate of the visual response and the rate in the fourth and fifth 1 seconds of the ISI are 0.085 and –0.098, respectively (third and fourth rows in the table). We conclude that beyond the first second of the ISI, the fluctuations of the visual response are not correlated with the fluctuations of the delay activity. This makes the single-cell scenario rather implausible.

Paradigmatic Working Memory Cell

1089

3.6 Information Transmission Across Trials. The discussion in section 3.2 of the meaning of prestimulus activity provides a partial confirmation of the scenario of information transport between first stimuli in consecutive trials, essential for the formation of the Miyashita correlations (Miyashita, 1988). That scenario requires that information about the coding of one stimulus be able to traverse the time interval between consecutive trials. The scenario proposed (Griniasty et al., 1993; Amit et al., 1994; Amit, 1995) is that DADs, when they first become stable, are uncorrelated. When they are stable, they would be provoked as much by a (same) second stimulus in a trial as by the first one. Since the second stimulus in half of the trials is the same as the first (to prevent bias in the DMS paradigm), in half of the training trials the activity in the intertrial interval will be the same as the delay activity elicited by the first stimulus of the preceding trial. If training is done by a fixed sequence of first stimuli, then upon half of the presentations of a given first stimulus, the activity distribution in the IT module, in the intertrial interval, would be the same as the delay activity corresponding to the immediately preceding first stimulus. The joint presence of the delay activity of the previous stimulus and the activity stimulated by the current first stimulus is hypothesized to be the information source for the learning, into the synaptic structure, of the correlations between the delay activities corresponding to consecutive images. We calculated the MRCs between the average rate in the 1 second immediately following the second stimulus and the average rate in an interval of 1 second immediately preceding the successive first stimulus (see Table 3). The idea is that since (for technical reasons) we do not have recordings during the entire interval separating two trials, if the activity between trials were to be undisturbed by the monkey’s response, to arrive at the next first stimulus, the MRC of the activities in the two extreme subintervals between trials should be similar to that between the two extreme subintervals in the undisturbed ISI. In fact, MRC = 0.320 for the intertrial interval performed over all the available trials (Nst = 123). The entire set of trials is composed of two well-balanced subsets with two different intertrial mean rates. This is so because the second stimulus is never degraded, so for Nst = 62 the second stimulus is a pure best image, leading to high intertrial persistent activity, and for Nst = 61 of the trials, the preceding second stimulus is the pure worst image, leading to low intertrial activity. This value is compared with the MRCs of extreme ISI intervals in trial subsets of similar statistics, with balanced subsets of low and high averages. Those would correspond to the first and second row in Table 1. In fact, the numbers are close. If the 123 trials are separated into two subsets—one with preceding best and one with preceding worst—we find MRC = 0.038 and MRC = 0.120, respectively (rows 2 and 3 in Table 3). These would be naturally compared with MRCs of subintervals of trials with best and worst first stimulus, with degradation levels (0–2) (third and fifth rows in Table 1) that are, respectively, 0.080 (Nst = 36) and 0.089 (Nst = 38).

1090

Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev

Table 3: Information Transmission Across Trials: MRCs of 1 Second Post–Second Stimulus and 1-Second Pre–Next First Stimulus Second Stimulus of Previous Trials

Time Intervals

B+W B W

POST and PRE POST and PRE POST and PRE

Number of Trials

MRC

123 62 61

0.320 0.038 0.120

Note: For a few trials, the time interval after the second stimulus is less than 1 second; for those trials, the mean rate is calculated over the available time interval. First row: the high MRC is due to presence of two different means. Last two rows: absence of correlation in unimodal subsets of trials.

4 Discussion The cell we have discussed presents a rich phenomenology, naturally interpreted in the learning-attractor paradigm. To establish the various characteristics of attractor dynamics and their internal structure, many more good recorded cells are required. Our underlying intention has been to exploit this cell merely as an example of the kind of features that a cell participating in a presumed attractor scenario may have. Looking back at the wealth of data drawn from this cell, we feel we can say more. This collection of data can hardly be accounted for in other paradigms suggested to date. Most prominent of these is the propagation of the persistent elevated rates following the test stimulus, after this matching stimulus is turned off, and even after the behavioral reaction is completed. The amount of data collected for this cell and the tools we have sharpened to analyze them leave no doubt that this propagation can indeed take place. In fact, we have found that this persistent activity is as robust when it crosses the intertrial interval (including the behavioral response) as it is crossing the ISI delay interval between sample and match. What is surprising is that following the second stimulus and the reaction, there is no apparent need for maintaining the active memory. One might have expected that the persistent activity distribution would die down at this point. It does not. This has been observed also by Nakamura and Kubota (1995), but the novelty here is that the delay activity continues to persist even when there seems to be no functional role for it. What seriously modifies the delay activity is a new visual stimulus—either the second stimulus following the ISI or the first in another trial, much as in Miller, Li, and Desimone (1993). The fact that a DAD can get across the intertrial interval is an essential pillar in the construction of the dynamic learning scenario for the generation of the Miyashita (1998) correlations in the internal representations (working

Paradigmatic Working Memory Cell

1091

memory) of stimuli often experienced in contiguity. Our single-cell result suggests that such correlations may find their origin in the arrival of selective persistent activity related to one stimulus at the presentation of the consecutive stimulus in the next trial. The collective attractor picture would have been fuller had we had a case in which there is no visual response and yet elevated delay activity. In Sakai and Miyashita (1991) such cases are shown. Yet the mean rate correlation analysis on this cell shows that neural activity during the delay period is correlated with the visual response only in the immediate interval following the stimulus—strong evidence that what determines the actual spike rate later into the ISI is beyond the single cell. It is most likely the result of the interaction of this cell with other cells that participate in the same DAD, via potentiated synapses. Finally, there is an additional, behavioral correlate to the activity of this cell. When the performance level (percentage correct responses) of the monkey is considered versus the level of stimulus degradation, averaged across experiments, it is found that the performance is similar for the first three levels of degradation. Then, for the fourth and fifth levels, performance drops abruptly to chance level, just as our cell indicates that the stimuli are not now in the basin of attraction of the DAD (Yakovlev et al., 1995). This effect calls for much further study, and it opens new vistas toward identification of a potential functional role of the delay activity. Sakai and Miyashita (1988) turned to the pair associate paradigm because the monkeys perform the DMS task well even for stimuli that do not generate delay activity (“new stimuli”). Here we see that in the absence of delay activity, the DMS task was not performed successfully if the sample stimulus was very degraded. Our paradigmatic cell suggests that the mechanism that allows matching to an identical new sample may not work when the monkey has to match to a very degraded image. Yet as long as the DAD (the attractor) is effective, it corrects for the degradation, by attracting to the prototype delay activity, and the task can be performed. Acknowledgments Without the experimental and intellectual contribution of Shaul Hochstein and Ehud Zohari, this article would not have been possible. Had we had it our way, both would have figured among the authors. We have benefited from the contribution of Gil Rabinovici to the experiment. This cell was found during his stay in our lab as a summer project in his course of premedical studies at Stanford University. We are also indebted to J. Maunsel and R. Shapley for help in the early stages of this experiment and the detailed comments of Misha Tsodyks. This work was partly supported by grants from the Israel Ministry of Science and the Arts and the Israel National Institute of Psychobiology and by Human Capital and Mobility grant ERB-CHRX-CT93-0245 of the EEC.

1092

Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev

References Amit, D. J. (1994). Persistent delay activity in cortex: A Galilean phase in neurophysiology? Network 5:429. Amit, D. J. (1995). The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioural and Brain Science 18:617. Amit, D. J., & Brunel, N. (1995). Learning internal representations in an attractor neural network with analogue neurons. Network 6:39. Amit, D. J., & Brunel, N. (1997). Global spontaneous activity and local structured (learned) delay activity in cortex. Cerebral Cortex 7(2):237. Amit, D. J., Brunel, N., & Tsodyks, M. V. (1994). Correlations of cortical Hebbian reverberations: Experiment vs theory. J. Neurosci. 14:6435. Brunel, N. (1994). Dynamics of an attractor neural network converting temporal into spatial correlations. Network 5:449. Brunel, N. (1996). Hebbian learning of context in recurrent neural networks. Neural Computation 8:1677. Fuster, J. M. (1995). Memory in the cerebral cortex. Cambridge, MA: MIT Press. Griniasty, M., Tsodyks, M. V., & Amit, D. J. (1993). Conversion of temporal correlations between stimuli to spatial correlations between attractors. Neural Computation 5:1. Kovacs, G., Vogels, R., & Orban, G. A. (1995). Selectivity of macaque inferior temporal neurons for partially occluded shapes. J. Neurosci. 15:1984. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature 331:68. Miller, E. K., Li, L., & Desimone, R. (1993). Activity of neurons in anterior inferior temporal cortex during a short-term memory task. J. Neurosci. 13:1460. Miyashita, Y. (1988). Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature 335:817. Miyashita, Y., Higuchi, S., Sakai, K., & Masui, N. (1991). Generation of fractal patterns for probing visual memory. Neurosci Res. 12:307. Nakamura, K., & Kubota, K. (1995). Mnemonic firing of neurons in the monkey temporal pole during a visual recognition memory task. J. Neurophysiol. 74:162. Sakai, K., & Miyashita, Y. (1991). Neural organisation for the long-term memory of paired associates. Nature 354:152. Williams, G. V., & Goldman-Rakic, P. S. (1995). Modulation of memory fields by dopamine D1 receptors in prefrontal cortex. Nature 376:572. Wilson, F. A. W., Scalaidhe, S. P. O., & Goldman-Rakic, P. S. (1993). Dissociation of object and spatial processing domains in primate prefrontal cortex. Science 260:1955. Yakovlev, V., Fusi, S., Amit, D. J., Hochstein, S., & Zohary, E. (1995). An experimental test of attractor network behavior in IT of the performing monkey. Israel J. of Medical Sciences 31:765.

Received June 7, 1996; accepted September 11, 1996.

Communicated by Todd Leen and Christopher Bishop

Noise Injection: Theoretical Prospects Yves Grandvalet St´ephane Canu CNRS UMR 6599 Heudiasyc, Universit´e de Technologie de Compi`egne, Compi`egne, France

St´ephane Boucheron CNRS-Universit´e Paris-Sud, 91405 Orsay, France

Noise injection consists of adding noise to the inputs during neural network training. Experimental results suggest that it might improve the generalization ability of the resulting neural network. A justification of this improvement remains elusive: describing analytically the average perturbed cost function is difficult, and controlling the fluctuations of the random perturbed cost function is hard. Hence, recent papers suggest replacing the random perturbed cost by a (deterministic) Taylor approximation of the average perturbed cost function. This article takes a different stance: when the injected noise is gaussian, noise injection is naturally connected to the action of the heat kernel. This provides indications on the relevance domain of traditional Taylor expansions and shows the dependence of the quality of Taylor approximations on global smoothness properties of neural networks under consideration. The connection between noise injection and heat kernel also enables controlling the fluctuations of the random perturbed cost function. Under the global smoothness assumption, tools from gaussian analysis provide bounds on the tail behavior of the perturbed cost. This finally suggests that mixing input perturbation with smoothness-based penalization might be profitable. 1 Introduction Neural network training consists of minimizing a cost functional C(.) on the set of functions F realizable by multilayer Perceptrons (MLP) with fixed architecture. The cost C is usually the averaged squared error, µ ¶2 C( f ) = EZ f (X) − Y

(1.1)

where the random variable Z = (X, Y) describing the data is sampled according to a fixed but unknown law. Because C is not computable, an Neural Computation 9, 1093–1108 (1997)

c 1997 Massachusetts Institute of Technology °

1094

Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron

empirically computable cost is then minimized in applications using a sample z` = {zi }`i=1 , with zi = (xi , yi ) ∈ Rd ×R gathered by drawing independent identically distributed data according to the law of Z. An estimate fˆemp of the regression function f ∗ (x) = arg min f ∈L2 C( f ) is given by the minimization of the empirical cost Cemp : Cemp ( f ) =

` 1X ` i=1

µ

¶2 f (xi ) − yi

.

(1.2)

The cost Cemp (.) is a random functional with expectation C(.). In order to justify the minimization of Cemp , the convergence of the empirical cost toward its expectation should be uniform with respect to f ∈ F (Vapnik, 1982; Haussler, 1992). When F is too large, this may not hold against some sampling laws. Hence practitioners have to trade the expressive power of F with the ability to control the fluctuations of Cemp (.). This suggests analyzing modified estimators that possibly restrict the effective search space. One of these modified training methods consists of applying perturbations to the inputs during training. Experimental results in Sietsma and Dow (1991) show that noise injection (NI) can dramatically improve the generalization ability of MLP. This is especially attractive because the modified minimization problem can be solved thanks to the initial training algorithm. During NI, the original training sample z` is distorted by adding some noise η to the inputs xi while leaving the target value yi unchanged. During the kth epoch of the backpropagation algorithm, a new distortion η k is applied to z` . The distorted sample is then used to compute the error and to derive the weights updates. A stochastic algorithm is thus m , defined as: obtained that eventually minimizes CNIemp C

m NIemp

m ` 1X 1 X (f) = m k=1 ` i=1

µ

¶2 i

k,i

f (x + η ) − y

i

,

(1.3)

where the number of replications m is set under user control but finite. The average value of the perturbed cost is: "

` 1X CNI ( f ) = Eη ` i=1

¶2 #

µ i

f (x + η ) − y

i

.

(1.4)

In this article, the noise η is assumed to be a centered gaussian vector with independent coordinates: E[η ] = 0 and E[η T η ] = σ 2 I. The success of NI is intuitively explained by asserting that minimizing CNI (see equation 1.4) ensures that similar inputs lead to similar outputs. It raises two questions: When should we prefer to minimize CNI rather than m converge toward C Cemp ? How does CNIemp NI ?

Noise Injection

1095

Recently, several authors (Webb, 1994; Bishop, 1995; Leen, 1995; Reed, Marks, & Oh, 1995; An, 1995) have resorted to Taylor expansions to describe the impact of NI and to motivate the minimization of CNI rather than Cemp . They not only try to provide a formal description of NI but also aim at m . finding a deterministic alternative to the minimization of CNIemp This article takes a different approach: when the injected noise is gaussian, the Taylor expansion approach is connected to the action of the heat kernel, and the dependence of CNI (see equation 1.4) on the noise variance is shown to obey the heat equation (see section 2.1). This clear connection between partial differential equations and NI provides some indications on the relevance domain of traditional Taylor expansions (see section 2.2). Finally, we analyze the simplified expressions that are assumed to be valid locally around optimal solutions (see section 2.3). The connection between NI and the action of the heat kernel also enables control of the fluctuations of the random perturbed cost function. Under some natural global smoothness property of the class of MLPs under consideration, tools from gaussian analysis provide exponential bounds on the probability of deviation of the perturbed cost (see section 3.3). This suggests that mixing NI with smoothness-based penalization might be profitable (see section 3.4). 2 Taylor Expansions 2.1 Gaussian Perturbation and Heat Equation. To exhibit the connection between gaussian NI and the heat equation, let us define u as a function from R+ × R`×d by:

u(0, x) =

` 1X ` i=1 "

u(t, x) = Eη

µ

¶2 f (xi ) − yi

` 1X ` i=1

(2.1) ¶2 #

µ i

i

f (x + η ) − y

i

.

(2.2)

Obviously, we have u(0, x) = Cemp ( f ) and u(t, x) = CNI ( f ), when t is the noise variance σ 2 . Each value of the noise variance defines a linear operator Tt that maps u(0, .) onto u(t, .). Moreover since the sum of two independent with variances s and t is gaussian with variance s + t, the family ¢ ¡gaussian Tt t≥0 defines a semigroup, the heat semigroup (cf. Ethier & Kutrz, 1986, for an introduction to semigroup operators). The function u obeys the heat equation (cf. Karatzas & Shreve, 1988, chap. 4, sec. 3, 4): 1 ∂u = ∆xx u ∂t 2

(2.3)

1096

Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron

where ∆xx is the Laplacian with respect to x, and where the initial conditions are defined by equation 2.1. For the sake of self-containment, a derivation of equation 2.3 is given in the appendix when initial conditions are square integrable. Let us denote by CNI ( f, t) the perturbed cost when the noise is gaussian of variance is t; equation 2.3 yields: CNI ( f, t) = Cemp ( f ) +

1 2

Z 0

t

∆xx CNI ( f, s) ds.

(2.4)

Therefore, CNI can be investigated in the purely analytical framework of partial differential equations (under the gaussian assumption). The possibility of forgetting about the original probabilistic setting when dealing with neural networks follows from the Tikhonov uniqueness theorem (Karatzas & Shreve, 1988, chap. 4, sec. 4.3). Observation 2.1. If F is a class of functions definable by some feedforward architecture using sigmoidal, radial basis functions, or piecewise polynomials as activation functions, and if injected noise follows a gaussian law, then the perturbed cost CNI is the unique function of the variance t that obeys the heat equation (2.1) with initial conditions defined in equation 2.1. Any deterministic faithful simulation of NI should use some numerical analysis software to integrate the heat equation and then run some backpropagation software on the result. We do not recommend such a methodology for efficiency reasons and insist that stochastic representations of solutions of partial differential equations (PDEs) have proved useful in analysis (Karatzas & Shreve, 1988). Methods reported in the literature (such as finite differences and finite element; cf. Press, Teukolsky, Vetterling, & Flannery, 1992) assume that the function defining the initial conditions has bounded support. This assumption that makes sense in physics is not valid in the neural network setting. Hence Monte-Carlo methods appear to be the ideal technique to solve the PDE problem raised by NI. 2.2 Taylor Expansion Validity Domain. Let CTaylor ( f ) be the first-order Taylor expansion of CNI ( f ) as a function of t = σ 2 . For various kinds of noise and function classes F , it has been shown in Matsuoka (1992), Webb (1994), Grandvalet and Canu (1995), Bishop (1995), and Reed et al. (1995) that: CTaylor ( f ) = Cemp ( f ) +

σ2 ∆xx Cemp ( f ). 2

(2.5)

In the context of gaussian NI, this means that the Laplacian is the infinitesimal generator of the heat semigroup. To emphasize the distinction

Noise Injection

1097

between equations 2.5 and 2.3, one should stress the fact that the heat equation is not only a correct description of the impact of gaussian NI in the small variance limit but also for any value of the variance. The Taylor approximation validity domain is restricted to those functions f such that lim

σ 2 →0

CNI ( f ) − CTaylor ( f ) = 0. σ2

(2.6)

Observation 2.2. A sufficient condition for the Taylor approximation to be valid is that Cemp belongs to the domain of the generator of the heat semigroup. The empirical cost Cemp has to be a licit initial condition for the heat equation, which is always true in the neural network context (cf. conditions in Karatzas & Shreve, 1988, theorems 3.3, 4.2, chap. 4). The preceding statement is purely analytical and does not say much about the relevance of minimizing CTaylor while training neural networks. This issue may be analyzed according to several directions: Is the minimization of CTaylor equivalent to the minimization of CNI ? Is the minimization of CTaylor interesting in its own right? The second issue and related developments are addressed in Bishop (1995) and Leen (1995). The first issue cannot be settled in a positive way for arbitrary variances in general, but the principle of minimizing CNI (and CTaylor ) should be definitively ruled out if if the minima of CNI (and CTaylor ) did not converge toward the minima of Cemp when t = σ 2 → 0. This cannot be deduced directly from equation 2.6 since it describes only simple convergence. A uniform convergence over F is required. As we will vary t, let us denote by CNI ( f, t) (respectively CTaylor ( f, t)) the perturbed cost (resp. its Taylor approximation) of f when the noise variance is t, we get: 1 CNI ( f, t) − CTaylor ( f, t) = 2

Z t· 0

¸

∆xx CNI ( f, s) − ∆xx Cemp ( f ) ds.

(2.7)

If some uniform (over F and s ≤ t0 < 0 ) bound on |∆xx CNI ( f, s) − ∆xx Cemp ( f )| is available, then limt→0 max f ∈F CNI ( f, t)−CTaylor ( f, t) = 0, and small manipulations using the triangular inequality reveal the convergence of minima. The same argument shows the convergence of the minima of CNI toward minima of Cemp . If some upper bounds is imposed on the weights of a sigmoidal neural network, those global bounds are automatically enforced. Imposing bounds on weights is an important requirement to ensure the validity of the Taylor approximation. We intuitively expect the truncation of the Taylor series to be valid in the small variance limit. But if f (x) = g(wx), where g is a parameterized function and w is a free parameter, then

Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron

Cemp

CTaylor

1098

0

0 −50 a

−100

−0.5

0

0.5

−50 a

b

−0.5

−100

0.5

0 b

CNI

C

a=−50

−0.5

0

0.5

C

a=−100 0 −50 a

−100

−0.5

0 b

0.5 −0.5

0 b

0.5

Figure 1: Costs Cemp (top left), CTaylor (top right), and CNI (bottom left) in the space (a, b). These costs are compared on the bottom-right figures for two values of the parameter a: Cemp is solid, CTaylor is dotted, and CNI is dashed.

f (x + η) = g(wx + wη). The noise η injected in f appears to g as a noise wη , that is, as a noise of variance w2 σ 2 . Therefore, if g is nonlinear (i.e., f nonlinear), the Taylor expansion will fail for large w2 . A simple illustration is given in Figure 1. The sample contains 10 (x, y) pairs, where the xi are regularly placed on [−0.5, 0.5], and y = 1x<0 : z` = {(−0.5, 1), (−0.4, 1), . . . , (−0.1, ± 1), (0.1, 0), . . . , (0.5, 0)}. The class F is defined by f (x) = exp(ax + b) (1 + exp(ax + b)) a, b ∈ R, and the quadratic loss is used. The three error surfaces for Cemp , CTaylor , and CNI are represented for σ 2 = 2.5 10−3 . Although the noise variance is quite small, there are some noticeable differences: the cost CNI is smoother than Cemp (effect of convolution), which is in turn smoother than CTaylor . The dissimilarities of these surfaces can be shown for any nonzero value of σ 2 by a proper choice of the scale. The bottom right drawings show the error as a function of b for two values of the scale parameter a. It is shown how the Taylor expansion becomes more and more innacurate as a2 increases. Note that

Noise Injection

1099

the number of oscillations on CTaylor is equal to the number of points in the sample. Remark. Although equation 2.5 has been established for various kinds of noise (strong integrability properties are sufficient), CNI is the solution of the heat equation only for gaussian noise. In nongaussian contexts NI usually does not define a semigroup of operators indexed by variance. 2.3 Local Analysis and Simplified Expressions. Requiring uniform bounds on |∆xx CNI ( f, s) − ∆xx Cemp ( f )| may be too demanding. Fortunately, it is possible to adapt the local approach to NI suggested in Leen (1995) and Bishop (1995) to provide another derivation of the behavior of the minima of CNI and CTaylor . The local approach is concerned with the behavior of CNI , CTaylor , and possibly other cost functions such as the derivative-based regularizer (Leen, 1995; Bishop, 1995) around minima (global or local) of Cemp . Although Bishop and Leen focused their attention on the case where E(Y|X = x) is realizable by the neural architecture and the empirical measure closely mimics the sampling law (e.g., Leen, 1995, sec. 3.1.1), we believe that their local approach can be extended and applied to the comparison of the critical points of Cemp , CNI , and CTaylor . A critical point of Cemp (and similarly for other costs) is a weight assignment where the gradient of Cemp vanishes. A critical point is nondegenerate iff the Hessian matrix has full rank. In the sequel, we will assume that Cemp has only nondegenerate critical points. This is not a major restriction since the set of target values yi for which Cemp does have degenerate critical points has measure 0 for the kind of neural architectures mentioned here. Moreover, the number of nondegenerate critical points is finite (Sontag, 1996). Observation 2.3. If Cemp has no degenerate critical points, then there exists some t0 > 0 such that for any t, 0 ≤ t ≤ t0 , to any critical point of Cemp , there correspond a critical point of CTaylor and a critical point of CNI ; moreover, those critical points are within distance Kt of each other for some constant K that depends on the sample under consideration. Proof. The argument extends Leen’s (1995) suggestion. In the course of the argument we will assume that CNI and CTaylor have partial derivatives with respect to t at t = 0; this can be enforced by taking a linear continuation for t < 0. Let us assume that F is parameterized by W weights. Let ∇w CNI ( f, t) and ∇w CTaylor ( f, t) denote the gradient of CNI and CTaylor with respect to the weight assignment for some value of f and σ 2 = t (note that at t = 0 the two values coincide with ∇w Cemp ( f )). If f • is some nondegenerate critical point of Cemp , then the matrix of partial derivatives of ∇w CNI ( f, t) and ∇w CTaylor ( f, t) with respect to weights and time has full rank; thus, by the implicit function theorem in its surjective form (Hirsch, 1976, p. 214), there exists a neighborhood of ( f • , 0),

1100

Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron

and diffeomorphisms φ and ψ defined in a neighborhood of (0, 0) ∈ RW+1 , such that φ(0, 0) = ( f • , 0) (resp. ψ(0, 0) = ( f • , 0)) and ∇w CNI (φ(u, v)) = u (resp. ∇w CTaylor (ψ(u, v)) = u). The gradients of φ and ψ at (0, 0) are of L2 norm less or equal than the norm of the matrix of partial derivatives of ∇w Cemp ( f • , 0). For sufficiently small values of t, φ(0, t) and ψ(0, t), define continuous curves of critical points of CNI (., t) and CTaylor (., t). As the number of critical points of Cemp is finite, we may assume that those curves do not intersect and that the norms of the gradients of φ(0, t) and ψ(0, t) with respect to t are upper bounded. The observation follows. Remark 1. For sufficiently small t, the local minima of Cemp can be injected in the set of local minima of CTaylor and CNI . For CTaylor , the reverse is true. The definition of CTaylor obeys the same constraints as the definition of Cemp : it is defined using solely +, ×, constants, and exponentiation; hence, by Sontag bound, for any t, except on a set of measure 0 of target values y, the number of critical points of CTaylor is finite, and the argument that was used in the proof works in the other direction. For sufficiently small t, CTaylor does not introduce new local minima. This argument cannot be adapted to CNI , which definition also requires an integration. Nevertheless, experimental results suggest that NI tends to suppress spurious local minima (Grandvalet, 1995). Remark 2. The validity of observation relies on the choice of activation functions. If F were constituted by the class of functions parameterized by α ≥ 0 mapping, x 7→ sin(αx). One could manufacture a sample such that Cemp and CTaylor both have countably many nondegenerate minima, which are in one-to-one correspondence and such that the convergence of the minima of CNI toward the minima of Cemp as t tends toward 0 is not uniform.

3 Noise Injection and Generalization The alleged improvement in generalization provided by NI remains to be analytically confirmed and explained. Most attempts to provide an explanation resort to the concepts of penalization and regularization. Usually penalization consists of adding a positive functional Ä(.) to the empirical risk. It is called a regularization if the sets { f : f ∈ F , Ä( f ) ≤ α} are compact for the topology on F . Penalization and regularization are standard ways of improving regression and estimation techniques (e.g., Grenander, 1981; Vapnik, 1982; Barron, Birge, & Massart, 1995). When F is the set of linear functions, NI has been recognized as a regularizer in the Tikhonov sense (Tikhonov & Arsenin, 1977). This is the only reported case, but it is enough to consider F as constituted by univariate degree 2 polynomial to realize that NI cannot generally be regarded as a

Noise Injection

1101

penalization procedure (Barron et al., 1995): CNI − Cemp is not always positive. Thus, the improvement of generalization ability attributed to NI still requires some explanations. 3.1 Noise Injection and Kernel Density Estimation. An appealing interpretation connects NI with kernel estimation techniques (Comon, 1992; Holmstrom ¨ & Koistinen, 1992; Webb, 1994). Minimization of the empirical risk might be a poor or inefficient heuristic because the minima (if there are any) of Cemp could be far away from those of C. Recall that when trying to perform regression with sigmoidal neural networks, we have no guarantees that CNI has a single global minima, or even that the infimum of CNI is realized (Auer, Hebster, & Warmuth, 1996). Hence the safest (and, to our knowledge, only) way to warrant the consistency of the NI technique is to get a global control on the fluctuations of CNI (.) with respect to C(.), that is, on supF |CNI ( f )−C( f )|. The poor performance of minimization of empirical risk could be due to the slow convergence of the empir1 P ical measure pˆZ = i δxi ,yi toward the sampling probability in the pseudo1

ˆ pˆ0 ) = sup f ∈F |Epˆ ( f (X)−Y)2 −Epˆ 0 ( f (X)−Y)2 |). metric induced by F ( dF (p− But minimizing CNI is equivalent (up to an irrelevant constant factor) to minimizing ` ¡ ¢2 1X Eη ,η 0 f (xi + η i ) − (yi + η 0 ) , ` i=1

(3.1)

where η 0 is a scalar gaussian independent of η . Minimizing CNI consists of minimizing the empirical risk against a smoothed version of the empirical measure: the Parzen-Rosenblatt estimator of the density (for details on the latter, see Devroye, 1987). The connection with gaussian kernel density estimation and regularization can thus be established in a perspective described in Grenander (1981). Because the gaussian kernel is the fundamental solution of the heat equation, the smoothed density that defines equation 3.1 is obtained by running the heat equation using the empirical measure as an initial condition (this is called the Weierstrass transform in Grenander (1981). It is then tempting to explain the improvement in generalization provided by NI using upper bounds on the rate of convergence of the Parzen-Rosenblatt estimator. Though conceptually appealing, this approach might be disappointing; it actually suggests solving a density estimation problem as a subproblem of a regression problem. The former is much harder than the latter (Vapnik, 1982; Devroye, 1987). 3.2 Consistency of NI. To assess the consistency of minimizing CNI , we would like to control CNI (.) − C(.). A reasonable way of analyzing the fluctuations of CNI (.) − C(.) consists of splitting it in two summands and

1102

Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron

bounding each summand separately: |C( f ) − CNI ( f )| ≤ |C( f ) − EpZ CNI ( f )| + |CNI ( f ) − EpZ CNI ( f )|.

(3.2)

The first summand bounds the bias induced by taking the smoothed version of the squared error. It is not a stochastic quantity; it should be analyzed using tools from approximation theory. This first summand is likely to grow with t. On the contrary, the second term captures the random deviation of CNI with respect to its expectation. Taking advantage that CNI depends smoothly on the sample, it may be analyzed using tools from empirical process theory (cf. Ledoux & Talagrand, 1991, chap. 14). It is expected that this second summand decreases with t. 3.3 Bounding Deviations of Perturbed Cost. As practitioners are likely m rather than CNI , another difficulty has to be faced: conto minimize CNIemp m with respect to CNI . For a fixed sample, trolling the fluctuations of CNIemp m is a sum of independent random functions with expectation CNI ; in CNIemp principle, it could also be analyzed using empirical process techniques. In the case of sigmoidal neural networks , the boundedness of the summed m ( f ) converges random variables ensures that for each individual f , CNIemp ¢ √ ¡ m (f) − C almost surely toward CNI ( f ) as m → ∞ and that m CNIemp NI ( f ) converges in distribution toward a gaussian random variable. But using empirical processes would not pay tribute to the fact that the sampling process defined by NI is under user control, and in our case gaussian. Sufficiently smooth functions of gaussian vectors actually obey nice concentration properties, as illustrated in the following theorem: Theorem 3.1 (Tsirelson, 1976; c. f. Ledoux & Talagrand, 1991). Let X be a standard gaussian vector on Rd . Let f be a Lipschitz function on Rd with Lipschitz constant smaller than L; then: ¾ ½ 2 2 P | f (X) − E f (X)| > r ≤ 2 exp−r /2L . m . 3.3.1 Pointwise Control of the Fluctuations of CNIemp If F is constituted by a class of sigmoidal neural networks, the differentiability assumption for the square loss as a function of inputs is automatically fulfilled.

Assumption 1. In the sequel, we will assume that all weights defining functions in F are bounded so that F is uniformly bounded by some constant M0 and the gradient of f ∈ F is smaller than some constant L. Let M be a constant greater than M0 + maxi yi . ¢ P P` ¡ i k,i i 2 is regarded as a function Then if 1/(`m) m k=1 i=1 f (x + η ) − y on R`md provided with √ the Euclidean norm, its Lipschitz constant is upper bounded by 2LM/ `m.

Noise Injection

1103

m (.) implies the following obserApplying the preceding theorem to CNIemp vation:

Observation 3.1.

If F satisfies assumption 1, then: ¾

½

Pη |C

m NIemp

( f ) − CNI ( f )| > r ≤ 2e−m`r

2

/(8tL2 M2 ) .

Remark. The dependence of the upper bound on the Lipschitz constant of CNI with respect to x cannot be improved since theorem 3.1 is tight for linear functions, but the combined dependence on M and L has to be assessed for sigmoidal neural networks. For those networks, large inputs tend to generate small gradients; hence the upper bound provided here may not be tight. m . m −C Pointwise control of CNIemp 3.3.2 Global Control on CNIemp NI is insufm with respect to ficient since we need to compare the minimization of CNIemp m ( f ) is a biased the minimization of CNI . We may first notice that inf f ∈F CNIemp estimator of inf f ∈F CNI ( f ): m ( f ) ≤ inf C Eη inf CNIemp NI ( f ). f ∈F f ∈F

(3.3)

m ( f ) is a concave function of the Second, we may notice that inf f ∈F CNIemp empirical measure defined by η ; hence, it is a backward supermartingale1 and thus converges almost surely toward a random variable as m → ∞. If m is to converge toward C m ( f ) is due to converge toward CNIemp NI , inf f ∈F CNIemp inf CNI . To go beyond this qualitative picture, we need to get a global control on m (f) − C the fluctuations of sup f ∈F |CNIemp NI ( f )|. Let g denote the function: ¯ ¯ ¯ ¯ 1 X¡ ¢ ¯ ¯ i k,i i 2 f (x + η ) − (y ) − CNI ( f )¯ . η 7→ sup ¯ ¯ ¯ m` i,k f ∈F

If F satisfies √ assumption 1, then g is also Lipschitz with a coefficient less than 2LM/ m`. m (f) − C Let us denote Eη sup f ∈F |CNIemp NI ( f )| by E. E may be finite or infinite. E is a function of t, `, m. 1 Up to some measurability conditions that are enforced for fixed architecture neural networks.

1104

Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron

Assumption 2. E is finite. Theorem 3.1 also applies to g , and we get the following concentration result: Observation 3.2. If F is a class of bounded Lipschitz functions satisfying assumptions 1 and 2, then ½ ¾ 2 2 2 m (f) − C Pη sup |CNIemp ( f )| > E + r ≤ 2e−m`r /(8tL M ) . NI f ∈F

(3.4)

This result is only partial in the absence of a good upper bound on E. It is possible to provide explicit upper bounds on E using metric entropy techm process described niques and using the subgaussian behavior of the CNIemp by observation 3.1. Using observation 3.2, we may partially control the deviations of approxm with respect to the infimum of C imate minima of CNIemp NI : Observation 3.3. If F satisfies assumptions 1 and 2, then if for any sample m ( f • ) < inf f • satisfies CNIemp f ∈F CNI ( f ) + ², then

ECNI ( f • ) < inf CNI ( f ) + 2E + ², f ∈F and ¾ ½ 2 2 2 Pη CNI ( f • ) ≥ inf CNI ( f ) + ² + 2(E + r) ≤ 2e−m`r /(8tL M ) . f ∈F If ² and E can be taken arbitrarily close to 0 using proper tuning of m and t, the m is a consistent way of approximating corollary shows that minimizing CNIemp the minimum of CNI . 3.4 Combining Global Smoothness Prior and Input Perturbation. The derivative-based regularizers proposed in Bishop (1995) and Leen (1995) are based on sums of terms evaluated at a finite set of data points. It has been argued that they are not global smoothing priors. At first sight, it may be that NI escapes this difficulty; the previous analysis, and particularly observation 3.1, as rough as it may be, shows that it may be quite hard to control NI in the absence of any smoothness assumption. Thus it seems cautious to supplement NI or data-dependent derivative-based regularizers with global smoothness constraints. For sigmoidal neural networks, weight decay can be combined with NI. The sum M of the absolute values of the weights provides an upper bound on the output values. Let L be the product over the different layers of the

Noise Injection

1105

sum of absolute values of the weights in those layers; L is an upper bound on the Lipschitz constant of the network. Penalizing by L+M, would restrict the search space so that CNI is well behaved. Such combinations of penalizers have been reported in the literature (Plaut, Nowlan, & Hinton, 1986; Sietsma & Dow, 1991) to be successful. However, it seems quite difficult to compare analytically the respective advantages of penalization alone and penalization supplemented with NI. This would require establishing lower rate of convergence for one of the techniques. Such rates are not available to our knowledge. 4 Conclusion and Perspective Under a gaussian distribution assumption, NI can be cast as a stochastic alternative to a regularization of the empirical cost by a partial differential equation. This is more likely to facilitate a rigorous analysis of the NI heuristic than to be a source of efficient algorithms. It allows the assignment of a precise meaning to the concept of validity of Taylor approximations of the perturbed cost function. For sigmoidal neural networks, and more generally for classes of functions that can be defined using exponentials and polynomials, those Taylor approximations are valid in the sense that minimizing CTaylor and CNI turns out to be equivalent in the infinitesimal variance limit. The practical relevance of this equivalence remains questionable since practical uses of NI require finite noise variance. The relationship between NI and the heat equation enables establishing a simple bridge between regularization and the transformation of Cemp into CNI : the contractivity of the heat semigroup ensures that the regularized version of the effective cost is easier to estimate than the original effective cost. Finally, because practical m of CNI applications are more likely to minimize empirical versions CNIemp than CNI itself, we resort to results from the theory of stochastic calculus to m . provide bounds on the tail probability of deviations of CNIemp Despite the clarification provided by the heat connection, this is far from being the last word. A lot of quantitative works needs to be done if theory is to meet practice. The global bounds provided in section 3.3 need to be complemented by a local analysis focused on the neighborhood of the critical points of Cemp . Ultimately this should provide rates of convergence for specific F and classes of target dependencies E(Y|x). Because practitioners often use stochastic versions of backpropagation, it will be interesting to see whether the approach advocated here can refine the results presented in An (1995). NI is a naive and rather conservative way of trying to enforce translation invariance while training neural networks. Other, and possibly more interesting, forms of invariance under transformation groups deserve to be examined in the NI perspective, as proposed by Leen (1995). Transformation groups can often be provided with a Riemannian manifold structure on which some heat kernels may be defined; thus it is appealing to check

1106

Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron

whether the generalizations of the tools presented here (the theory of diffusion on manifolds; cf. Ledoux & Talagrand, 1991, for some results) can facilitate the analysis of NI versions of tangent-prop–like algorithms. Appendix: From Heat Equation to Noise Injection This appendix rederives the relation between the heat equation and NI using Fourier analysis techniques. To put the idea in the simplest setting, the problem is treated in one dimension. The heat equation under concern is:  ∂u  = 12 ∆xx u,   ∂t ¶2 ` µ 1X   f (xi ) − yi .  u(0, x) = ` i=1 The Fourier transform of the heat equation is: 1 ∂b u(t, ξ ) = − ξ2 b u(t, ξ ). ∂t 2

(A.1)

This is an ordinary differential equation whose solution is: µ 2 ¶ ξ t b , u(t, ξ ) = K(ξ ) exp − 2

(A.2)

where the integration constant K(ξ ) is the Fourier transform of the initial condition; thus: µ 2 ¶ ξ t b . u(t, ξ ) = b u(0, ξ ) exp − 2

(A.3)

The inverse Fourier transform maps the product into a convolution; this entails: µ µ 2 ¶¶ ξ t . u(t, x) = u(0, t) ∗ F−1 exp − 2

(A.4)

Thus, we recover the definition of CNI : CNI ( f, t) = Cemp ( f ) ∗ N(t),

(A.5)

where N(t) is the density of a centered normal distribution with variance t.

Noise Injection

1107

Acknowledgments We thank two anonymous referees for helpful comments on a draft of this article, particularly for suggesting that NI or derivative-based regularizers should be combined with global smoothers. Part of this work has been done while St´ephane Boucheron was visiting the Institute for Scientific Interchange in Torino. Boucheron has been partially supported by the GermanFrench PROCOPE program 96052. References An, G. (1995). The effects of adding noise during backpropagation training on generalization performance. Neural Computation 8:643–674. Auer, P., Hebster, M., & Warmuth, M. K. (1996). Exponentially many local minima for single neurons. Advances in Neural Information Processing Systems 8:316–322. Barron, A., Birge, L., & Massart, P. (1995). Risk bounds for model selection via penalization (Tech. Report). Universit´e Paris-Sud. http://www.math.upsud.fr/stats/preprints.html. Bishop, C. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation 7(1):108–116. Comon, P. (1992). Classification supervis´ee par r´eseaux multicouches. Traitement du Signal 8(6):387–407. Devroye, L. (1987). A Course in Density Estimation. Basel: Birkh¨auser. Ethier, S., & Kutrz, T. (1986). Markov Processes. New York: Wiley. Grandvalet, Y. (1995). Effets de l’injection de bruit sur les perceptrons multicouches. Unpublished doctoral dissertation, Universit´e de Technologie de Compi`egne, Compi`egne, France. Grandvalet, Y. & Canu, S. (1995). A comment on noise injection into inputs in back-propagation learning. IEEE Transactions on Systems, Man, and Cybernetics 25(4):678–681. Grenander, U. (1981). Abstract Inference. New York: Wiley. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation 100:78– 150. Hirsch, M. (1976). Differential Topology. New York: Springer-Verlag. Holmstrom, ¨ L., & Koistinen, P. (1992). Using additive noise in back-propagation training. IEEE Transactions on Neural Networks 3(1):24–38. Karatzas, I., & Shreve, S. (1988). Brownian Motion and Stochastic Calculus (2nd ed.). New York: Springer-Verlag. Ledoux, M., & Talagrand, M. (1991). Probability in Banach Spaces. Berlin: SpringerVerlag. Leen, T. K. (1995). Data distributions and regularization. Neural Computation 7:974–981. Matsuoka, K. (1992). Noise injection into inputs in backpropagation learning. IEEE Transactions on Systems, Man, and Cybernetics 22(3):436–440.

1108

Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron

Plaut, D., Nowlan, S., & Hinton, G. (1986). Experiments on learning by back propagation (Tech. Rep. CMU-CS-86-126). Pittsburgh, PA: Carnegie Mellon University, Department of Computer Science. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical Recipes in C. Cambridge: Cambridge University Press. Reed, R., Marks II, R., & Oh, S. (1995). Similarities of error regularization, sigmoid gain scaling, target smoothing and training with jitter. IEEE Transactions on Neural Networks 6(3):529–538. Sietsma, J., & Dow, R. (1991). Creating artificial neural networks that generalize. Neural Networks 4(1):67–79. Sontag, E. D. (1996). Critical points for least-squares problems involving certain analytic functions, with applications to sigmoidal neural networks. Advances in Computational Mathematics 5:245–268. Tikhonov, A. N., & Arsenin, V. Y. (1977). Solution of Ill-Posed Problems. Washington, DC: W. H. Wilson. Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. New York: Springer-Verlag. Webb, A. (1994). Functional approximation by feed-forward networks: A leastsquares approach to generalization. IEEE Transactions on Neural Networks 5(3):363–371.

Received February 1, 1996; accepted September 4, 1996.

Communicated by Marwan Jabri

The Faulty Behavior of Feedforward Neural Networks with Hard-Limiting Activation Function Zhiyu Tian Department of Automation, Tsinghua University, Beijing 100084, People’s Republic of China

Ting-Ting Y. Lin Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, California 92093-0407, U.S.A.

Shiyuan Yang Shibai Tong Department of Automation, Tsinghua University, Beijing 10084, People’s Republic of China

With the progress in hardware implementation of artificial neural networks, the ability to analyze their faulty behavior has become increasingly important to their diagnosis, repair, reconfiguration, and reliable application. The behavior of feedforward neural networks with hardlimiting activation function under stuck-at faults is studied in this article. It is shown that the stuck-at-M faults have a larger effect on the network’s performance than the mixed stuck-at faults, which in turn have a larger effect than that of stuck-at-0 faults. Furthermore, the fault-tolerant ability of the network decreases with the increase of its size for the same percentage of faulty interconnections. The results of our analysis are validated by Monte-Carlo simulations. 1 Introduction Artificial neural networks (ANN) are generally claimed to be intrinsically reliable and fault tolerant in the neural network literature (Hecht-Nielsen, 1988; Lisboa, 1992). With the advance research on their hardware implementation, the analysis of ANN’s faulty behavior becomes increasingly important for their diagnosis, repair, and reconfiguration. Faulty characteristics of feedback ANNs were discussed in Nijhuis and Spaanenburg (1989); Belfore and Johnson (1991); Chung and Krile (1992); and Protzel, Palumbo, and Arras (1993). The effect of weight disturbances on the performance degradation of feedforward ANNs has been examined by Stevenson, Winter, and Widrow (1990); Choi and Choi (1992); and Piche (1992). However, none of the feedforward analysis adopted the stuck-at fault Neural Computation 9, 1109–1126 (1997)

c 1997 Massachusetts Institute of Technology °

1110

Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong

model, which has been a popular model used in the analysis of feedback network interconnections. The use of stuck-at fault model in characterizing feedforward ANNs is the focus here. In this article, the ANN modeled has L layers, with the first layer being the input layer, the Lth layer being the output layer, and the L − 2 layers in between being the hidden layers. Each node in the output layer and the hidden layers receives the outputs from all the nodes in the preceding layer as its inputs, and produces an output as follows: Ãn ! `−1 X wi xi + θj (1.1) j = 1, . . . , n` , yj = Sgn i=1

where n` is the number of nodes in the `th layer (` = 2, . . . , L); xi (i = 1, . . . , n`−1 ) is the ith input of the node, which has only two values: +1 and −1; wi (i = 1, . . . , n`−1 ) is the weight of the ith input connection, with values in the range of [−M, +M]; and θ is the bias, having values in the same range as wi . When nl−1 is large, the effect of θ can often be neglected. Three types of stuck-at faults are considered: stuck-at-M (i.e., stuck-at-(−M)), stuck-at-0, and a mixture of s-a-M (−M) and s-a-0. Specifically, interconnection faults are represented with either M (−M) or 0, replacing the corresponding wi xi . Our work focuses on the general effect of faults at the output, while the inputs and weights are random variables assumed to be independent identically distributed (iid), and the means of the inputs and weights are 0. These assumptions support the fact that these neural networks can be used in various applications, thus assuming any specific set of values for input and weights would not be appropriate. The network is evaluated using the probability that the output is correct when faults occur as its performance measure. The higher the probability, the more fault tolerant. Section 2 presents the effect of interconnection faults on the performance of an individual neuron. Section 3 shows the effect of erroneous inputs on the performance of an individual neuron. Section 4 combines the effects of both erroneous inputs and faulty interconnections on the performance of an individual neuron. The behavior of faulty feedforward networks with a hard-limiting activation function is presented in section 5. And the performance of the network when s-a-M, s-a-(−M), and s-a-0 faults exist is analyzed in section 6. Sections 7 and 8 conclude the article. 2 The Effect of Interconnection Faults on the Performance of an Individual Neuron We begin with a fault-free case, where the output of a neuron, neglecting the bias θ , is:  n P   wi xi ≥ 0, 1 i=1 (2.1) y= n P   wi xi < 0, −1 i=1

Faulty Behavior of Feedforward Neural Networks

1111

where n is the number of inputs. In case of n f faults, let F be the set of faulty interconnections and S be the set of normal ones; the output becomes:

y0 =

 1 −1

P i∈S P

wi xi + n f M ≥ 0, (2.2)

wi xi + n f M < 0.

i∈S

The output error, 1y, defined as the difference of the respective outputs is:

0

1y = y − y =

   2    −2     0

P

wi xi + n f M ≥ 0,

i∈S

P

wi xi + n f M < 0,

n P i=1 n P

wi xi < 0, wi xi ≥ 0,

(2.3)

i=1

i∈S

otherwise.

P Pn Let event P A be i=1 wi xi i∈F wi xi , we have i=1 wi xi < i∈S wi xi + n f M, which means that B implies A (i.e., B ⊂ A). We therefore have Pr {AB} = Pr {B}

(2.4)

Pr {1y = 2} = Pr {AB} = Pr {A} − Pr {AB} = Pr {A} − Pr {B}

(2.5)

Pr {1y = −2} = Pr {A B} = Pr {B} − Pr {AB} = 0,

(2.6)

where Pr {•} is the occurrence probability of Event •. Plugging in event B, we obtain )

( Pr {B} = Pr

X

wi xi < −n f M .

(2.7)

i∈S

We assume that wi and xi (i = 1, . . . , n) are iid random variables; the mean and the variance of wi are 0 and σw2 , respectively. When the means of the weights and inputs are not zero, the static offset will not affect the result of our analysis. If xi is either +1 or −1, E[x2i ] = 1 (i = 1, . . . , n). Furthermore, 2 is: the mean of wi xi would be 0, and its variance σwx 2 σwx = E[(wi xi )2 ] − {E[wi xi ]}2 = E[w2i ]E[x2i ]

= σw2 + (E[wi ])2 = σw2

(2.8)

To solve for Pr {B}, we use theP central limit theorem, which implies that when p n − n f is sufficiently large, ( i∈S wi xi )/( n − n f σwx ) approximates normal

1112

Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong

distribution N(0, 1). Equation 2.7 can now be written as:  P  wi xi   M n f i∈S <−p Pr {B} = Pr p  n − n f σw n − n f σw  Z

−√

nf M

n−n f σw

≈

u2 1 √ e− 2 du. 2π

(2.9)

Equation 2.9 is now plugged into equations 2.5 and 2.6 to get the probability of the output error: Z − √nf M u2 1 1 n−n f σw Pr {1y 6= 0} = Pr {1y = 2} = − √ e− 2 du 2 2π −∞ Z √nf M u2 1 n−n f σw = √ e− 2 du. 2π 0

(2.10)

Therefore, the probability that a neuron will produce a correct output, Pc , is: Pc = Pr {1y = 0} = 1 −

Z √nf M n−n σ

f w

0

u2 1 √ e− 2 du. 2π

(2.11)

Let P = n f /n; equation 2.11 becomes √ P nM √ 1−Pσw

Z Pc = 1 − 0

u2 1 √ e− 2 du. 2π

(2.12)

It can be observed that Pc is a monotonically decreasing function of P. We performed Monte Carlo simulations 10,000 times for each point of wi (i = 1, . . . , n), with n = 50, 100, and 200, uniformly distributed in the range [−1, 1] on a Compaq 486 computer and contrasted to that obtained with equation 2.12 in Figure 1. It is shown that equation 2.12 is an accurate representation of the s-a-M faulty behavior of a neuron. We consider the effects of s-a-0 interconnetion faults. When such faults occur, the output error is:

1y =

   2    −2     0

n P i=1 n P

wi xi < 0,

P

wi xi ≥ 0,

i∈S

wi xi ≥ 0,

i=1

otherwise.

P

i∈S

wi xi < 0,

(2.13)

Faulty Behavior of Feedforward Neural Networks

1113

Figure 1: Performance of individual neurons under s-a-M faults.

Let

P s= p

P

wi x i

i∈S

n − n f σw

,

wi xi i∈E t= √ , n f σw

then Pr {1y = 2} = Pr {1y = −2} o np p p = Pr n − n f σw s + n f σw t < 0, n − n f σw s ≥ 0 . (2.14) According to the central limit theorem, if both n − n f and n f are sufficiently large (in this context, four or more can be considered sufficiently large), both s and t approximate normal distribution. Therefore equation 2.14 can be reduced to: ZZ 1 − s2 +t2 e 2 ds dt, (2.15) Pr {1y = 2} ≈ 2π D

p p √ where D is the region satisfying n − n f σw s+ n f σw t < 0 and n − n f σw s ≥ 0. From equation 2.15 by means of polar coordinate, we get: s nf 1 tan−1 , (2.16) Pr {1y = 2} = 2π n − nf where the symbol tan−1 stands for arc tangent.

1114

Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong

Figure 2: Performance of individual neurons under s-a-0 faults.

Hence, the probability that a neuron will produce a correct output is 1 Pc = 1 − Pr {1y = 2} − Pr {1y = −2} = 1 − tan−1 π r P 1 , = 1 − tan−1 π 1−P

s

nf n − nf (2.17)

with P = n f /n. This is plotted against that from the Monte Carlo simulation in Figure 2. Although Pc is a monotonically decreasing function of P, Pc is never smaller than 0.5. This is due to the fact that for a neuron with a hard-limiting activation function, there are only two possible outputs: +1 and −1. Even if all the input interconnections are faulty, the output will assume one of the two values; thus, there is 50 percent chance that the erroneous output coincides with the correct output. 3 The Effect of Erroneous Inputs on the Performance of an Individual Neuron For the study of the effect of erroneous inputs on the performance of an individual neuron, we assume that there are ne erroneous inputs among all the n inputs of a neuron. Let R be the set of correct inputs and E be the set of erroneous inputs. The output error of the neuron under erroneous inputs

Faulty Behavior of Feedforward Neural Networks

is:

1y =

Let

   2   

n P

−2      0 P

i=1 n P

wi xi < 0,

P

wi xi −

i∈R

wi xi ≥ 0,

P

wi xi ≥ 0,

i∈E

wi xi −

i∈R

i=1

P

1115

P

(3.1)

wi xi < 0

i∈E

otherwise. P

wi xi

i∈R

, s= √ n − ne σw

wi xi i∈E t= √ , ne σw

when n − ne and ne are sufficiently large, both s and t approximate normal distribution N(0, 1), and the probability of incorrect output is Pr {1y = 2} = Pr {1y = −2} √ √ √ = Pr { ne σw t < n − ne σw s ≤ − ne σw t} √ √ √ = Pr { ne t ≤ n − ne s < − ne t} ZZ 1 − s2 +t2 = e 2 ds dt, 2π

(3.2)

D

√ √ √ where D is the region satisfying ne t ≤ n − ne s < − ne t. Through double integration we get: r ne 1 −1 . (3.3) Pr {1y = 2} = tan π n − ne Hence, 2 Pc = 1 − Pr {1y = 2} − Pr {1y = −2} = 1 − tan−1 π r 2 P , = 1 − tan−1 π 1−P

r

ne n − ne (3.4)

with P being ne /n. Figure 3 compares equation 3.4 with that from the Monte Carlo simulation, with wi (i = 1, . . . , n) uniformly distributed in the range [−1, 1], and 10,000 runs for each point. It is observed that Pc is a monotonically decreasing function of P. 4 The Combined Effect of Both Erroneous Inputs and Faulty Interconnections on the Performance of an Individual Neuron We now consider the case that there are ne erroneous inputs (the set of correct inputs is R and the set of incorrect inputs is E) and n f interconnection

1116

Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong

Figure 3: Performance of individual neurons under input errors.

faults (the set of normal connections is S, while the set of faulty connections is F) among the n inputs to a neuron. This situation can be considered a combination of two sequential events: ne erroneous inputs, followed by n f of the input connections, have stuck-at faults. Based on equation 3.1, let Event A be: Event B be:

n P i=1 P

wi xi < 0. P wi xi − wi xi < 0.

i∈R

i∈E

(4.1)

Event C be: when n f stuck-at faults occur after erroneous inputs appear, the output y0 < 0. The output error caused by both faulty inputs and interconnections is:  AC 2 1y = −2 AC (4.2)  0 otherwise. Pr {1y = 2} = Pr {AC} = Pr {ABC} + Pr {AB C} = Pr {AB}Pr {C/AB} + Pr {AB}Pr {C/AB}.

(4.3)

Event C occurs primarily due to B or B, and its relationship with A is neglible; thus we approximate in the following: Pr {C/AB} ≈ Pr {C/B} =

Pr {BC} = 2Pr {BC}. Pr {B}

(4.4)

Faulty Behavior of Feedforward Neural Networks

Pr {C/AB} ≈ Pr {C/B} = 2Pr {B C}.

1117

(4.5)

From equations 2.5 and 3.3 we have: 1 1 Pr {AB} = − tan−1 2 π 1 Pr {AB} = tan−1 π

r

r

ne , n − ne

(4.6)

ne . n − ne

(4.7)

If the type of the input connection faults is s-a-M, according to equations 2.10 and 2.6, Pr {BC} =

Z √nf M n−n σ

f w

0

u2 1 √ e− 2 du, 2π

Pr {B C} = Pr {B} − Pr {BC} = Pr {B} =

(4.8) 1 . 2

(4.9)

Substituting equations 4.4 through 4.9 into 4.3, we derive the following equation: r ¶ Z √nf M u2 1 ne 1 1 n−n f σw −1 − tan Pr {1y = 2} = 2 √ e− 2 du 2 π n − ne 2π 0 r n 1 e . (4.10) + tan−1 π n − ne µ

Similarly, Pr {1y = −2} = Pr {AC} = Pr {ABC} + Pr {A B C} = Pr {AB}Pr {C/B} + Pr {A B}Pr {C/B}   r Z √nf M u2 n 1 1 2 n−n σ e w f  − = tan−1 √ e− 2 du . (4.11) π n − ne 2 2π 0 Hence Pr {1y 6= 0} = Pr {1y = 2} + Pr {1y = −2} r Z √nf M Z √nf M 1 − u2 ne 1 − u2 4 n−n f σw n−n f σw −1 e 2 du = √ e 2 du − tan π n − ne 0 2π 2π 0 r ne 2 . (4.12) + tan−1 π n − ne

1118

Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong

If the interconnection faults are s-a-0 faults, equations 4.8 and 4.9 become 1 tan−1 Pr {BC} = 2π

s

nf , n − nf

(4.13)

1 1 Pr {B C} = Pr {B} − Pr {BC} = − tan−1 2 2π

s

nf . n − nf

(4.14)

Following the method similar to that used in deriving equation 4.12, we get: 1 Pr {1y 6= 0} = π

Ã

s −1

tan

nf + 2 tan−1 n − nf

4 − tan−1 π

s

r

nf tan−1 n − nf

ne n − ne

r

ne n − ne

! .

(4.15)

5 The Faulty Behavior of Feedforward Networks with Hard-Limiting Activation Function We will study the faulty behavior of feedforward networks with hardlimiting activation function. If P f of the input connections to all the nodes in the hidden layers and the output layer are faulty, then for every node in the `th layer (` = 2, . . . , L): n = n`−1

` = 2, . . . , L,

(5.1)

n f = P f • n`−1

` = 2, . . . , L,

(5.2)

Pr {ne = k} = Ckn`−1 PEk`−1 (1 − PE`−1 )n`−1 −k

` = 2, . . . , L,

(5.3)

where PE` is the probability that the output of a node in the `th layer is incorrect. For the first hidden layer, ne = 0; for the layers that follow, ne 6= 0. Then we must use equation 4.12 or 4.15 to compute Pr {1y 6= 0}. q When

we examine equations 4.12 through 4.15, we find that α1 = tan−1

ne n−ne

is

present in both equations. Since ne is a random variable taking value k with a probability expressed in equation 5.3, α1 is also a random variable. For each ne = 0, 1, . . . , n, the mean of α1 is E[α1 ] =

n X k=−0

s tan−1

k • Pr {ne = k}. n − nk

(5.4)

Faulty Behavior of Feedforward Neural Networks

1119

Figure 4: Faulty performance of the network when every node has the same fraction of faulty interconnections.

If we substitute α1 with its mean into equation 4.12 or 4.15, PE` = Pr {1y 6= 0} can be derived. This procedure can be repeated until ` = L to derive PE` of the output layer. However, if n`−1 is large, the calculation of α1 can be tedious for each ne = k (k = 0, 1, . . . , n`−1 ). We can then approximate this calculation by using the expected value of ne , n¯ e = PE`−1 n`−1

` = 2, . . . , L,

(5.5)

for ne . Based on this approximation, we obtained the performance curves of Pc and compared them to the simulation results of two types of threelayered networks (100, 100, 1, and 200, 200, 1) with s-a-M faults in Figure 4. Two thousand simulations are performed for each point for this and later figures. The error is within an acceptable range. In practice, only the percentage of the faulty interconnections is known a priori, while the distribution among all network interconnections is arbitrary. In that case, we use the expected value of n f , n f = P f • n`−1

` = 2, . . . , L

(5.6)

for n f in equation 4.12. Both the theoretical and simulation results of Pc of three types of three-layered networks with s-a-M faults are shown in Figure 5. Furthermore, we generated the performance curves for the three-layered and five-layered networks under s-a-0 faults and compared them to the

1120

Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong

Figure 5: Performance of three-layered networks under s-a-M faults.

simulation results in Figure 6. Although the error is larger than that in Figure 4, it gradually decreases as the network size increases. In addition, both the theoretical and the simulation results follow the same curvature. 6 The Network Performance When All Three Types of Stuck-At Faults Exist In the previous sections, we analyzed network performance in the presence of only one type of the stuck-at faults. In reality, more than one type of fault may exist simultaneously. We study the case of multiple faults given that a fault could be an s-a-M(−M) fault with a probability q (0 < q ≤ 1/2), or an s-a-0 fault with a probability 1 − 2q. Thus, wi xi , depending on the faulty interconnections, will be replaced by a random variable zi (i ∈ F), which could take values M, −M, and 0 with probabilities q, q, and 1 − 2q, respectively. We also derive that E[zi ] = 0 and Var[zi ] = 2M2 q. The output of an individual neuron with n f faulty interconnections is:

y0 =

 1 −1

P i∈S P i∈S

wi xi + wi xi +

P i∈F P i∈F

zi ≥ 0, zi < 0.

(6.1)

Faulty Behavior of Feedforward Neural Networks

1121

Figure 6: Performance of networks under s-a-0 faults.

The output error is:

1y =

    2   

n X

 −2      0

i=1 n P

wi xi < 0,

X

wi xi +

i=1

P

zi ≥ 0,

i∈F

i∈S

wi xi ≥ 0,

X

wi xi +

P

(6.2)

zi < 0,

i∈F

i∈S

otherwise.

Therefore,

Pr {1y = 2} = Pr

( n X

wi xi < 0,

i=1

X i∈S

wi xi +

X

) zi ≥ 0 .

(6.3)

i∈F

Let P s= p

P

wi x i

i∈S

n − n f σw

,

i∈F

wi xi

t= √ , n f σw

P

zi

i∈F

, u= p 2n f pM

according to the central limit theorem, when n − n f and n f are sufficiently

1122

Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong

large and s, t, and u all follow normal distribution N(0, 1). Therefore: Pr {1y = 2} q o np p p n − n f σw s + 2n f qMu ≥ 0, n − n f σw s + n f σw t < 0 = Pr ZZZ = Ä

1 2 1 2 2 e− 2 (s +t +u ) ds dt du, (2π)3/2

(6.4)

p p where Ä is the three-dimensional region satisfying n − n f σw s+ 2n f pMu ≥ p √ 0 and n − n f s + n f t < 0. Let us imagine this region is like a huge sphere centered at the origin with radius R → ∞. Because the two boundary planes intersect at the origin, the region Ä can be regarded as a lune whose angle is the dihedral angle between the two planes, which is: s nf 2nn f qM2 + α = tan−1 n − nf (n − n f )2 σw2 s 2P f qM2 Pf = tan−1 + . (6.5) 1 − Pf (1 − P f )2 σw2 We can rotate the coordinate so that the U axis coincides with the intersection line. Thus, by employing cylindrical coordinate we have: Z ∞ Z αZ ∞ u2 +r2 1 e− 2 r dr dθ du Pr {1y = 2} = 3 −∞ 0 0 (2π ) 2 α . (6.6) = 2π Hence the probability of incorrect output is Pr {1y 6= 0} = Pr {1y = 2} + Pr {1y = −2} =

α . π

(6.7)

Figure 7 shows the comparison between the theoretical and the simulation results of a neuron’s performance under such mixed faults. It is shown that the simulation follows closely with equation 6.7. Now we proceed to study the combined effect of erroneous inputs and the mixed stuck-at faults on the performance of a neuron. In the case of mixed faults, equations 4.8 and 4.9 become: Pr {BC} =

α 2π

(6.8)

Pr {B C} =

α 1 − . 2 2π

(6.9)

Faulty Behavior of Feedforward Neural Networks

1123

Figure 7: Performance of individual neurons under mixed faults.

Therefore, the probability of incorrect output is: Pr {1y 6= 0} = 2Pr {1y = 2} r r µ ¶ ne ne 4 1 −1 −1 α + 2 tan − α tan = . (6.10) π n − ne π n − ne Following the steps described in section 5, we can calculate the faulty performance of the network. Figure 8 shows the performance curves for a three-layered network (100, 100, 1), a four-layered network (100, 100, 100, 1), and a five-layered network (100, 100, 100, 100, 1) (q = 1/3). From Figure 7, we see that the performance of an individual neuron degrades with the increase of q. This implies that the s-a-M (s-a-(−M)) faults have a larger effect on the network’s performance than s-a-0 faults. 7 Discussion It was believed that the fault-tolerance ability of ANNs improves with the growth of network size (Nijhuis & Spaanenburg, 1989), which has been shown the opposite for the feedforward ANNs with hard-limiting activation functions when considering the fraction (instead of the number) of faulty interconnections. This was demonstrated initially in Figure 5 for networks with the same number of layers: the more hidden nodes, there are the faster the performance degrades with the increased percentage of s-a-M interconnections. Furthermore, Figure 9 shows the performance curves for

1124

Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong

Figure 8: Performance of ANNs under mixed stuck-at faults.

networks with different numbers of layers but the same number of nodes in every layer. Tolerance for s-a-M faults is weakened as layers increase. When comparing two networks with the same total number of hidden nodes, as shown in Figure 10, the network with fewer layers (three layers: 100, 100, 1) is more fault tolerant than that with more layers (100, 50, 50, 1). The same conclusion can be drawn for s-a-0 and mixed faults, as shown in Figure 6 and 8. This suggests that ANN designers should make every effort to limit the number of layers and the number of nodes in every layer for better fault tolerance in feedforward neural networks with hard-limiting activation functions, provided that other performance requirements are satisfied. 8 Conclusion In this article, we have analyzed the behavior of feedforward neural networks with hard-limiting activation function under stuck-at faults. The relationship between the network performance and the number of faulty interconnections, together with the method for analyzing the faulty behavior, is also presented. Through comparison with Monte Carlo simulation results, the correctness of our formulation for individual neurons is verified. Furthermore, our results demonstrate that increasing the size of a feedforward neural network will lower its ability to tolerate stuck-at faults when

Faulty Behavior of Feedforward Neural Networks

1125

Figure 9: s-a-M faulty performance of networks with different number of layers.

Figure 10: Comparison between the faulty performance of networks with the same total number of hidden nodes but different number of hidden layers.

the fraction of faults is considered and that s-a-M (s-a-(−M)) faults have a larger effect on the network than s-a-0 faults. These conclusions are extremely useful in the design of feedforward networks.

1126

Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong

References Belfore, L. A., & Johnson, B. W. (1991). The analysis of the faulty behavior of synchronous neural networks. IEEE Trans. on Computers 40: 1424–1429. Choi, J. Y., & Choi, C.-H. (1992). Sensitivity analysis of multilayer Perceptron with differentiable activation functions. IEEE Trans. on Neural Networks 3:101– 107. Chung, P.-C., & Krile, T. F. (1992). Characteristics of Hebbian-type associative memories having faulty interconnections. IEEE Trans. on Neural Networks 3:969–980. Hecht-Nielsen, R. (1988). Neurocomputing: Picking the human brain. IEEE Spectrum 25:36–41. Lisboa, P. G. J. (1992). Neural networks. London: Chapman Hall. Nijhuis, J. A. G., & Spaanenburg, L. (1989). Fault tolerance of neural associative memories. IEEE Proceedings 136 (pt. E): 389–394. Piche, S. (1992). Robustness of feedforward neural networks. In Proceedings of IJCNN (pp. 346–351). Baltimore, MD. Protzel, P. W., Palumbo, D. L., & Arras, M. K. (1993). Performance and faulttolerance of neural networks for optimization. IEEE Trans. Neural Networks 4:600–614. Stevenson, M., Winter, R., & Widrow, B. (1990). Sensitivity of feedforward neural networks to weight errors. IEEE Trans. on Neural Networks 1:71–80.

Received February 24, 1994; accepted September 4, 1996.

Communicated by Bill Horne

Analysis of Dynamical Recognizers Alan D. Blair Jordan B. Pollack Department of Computer Science, Volen Center for Complex Systems, Brandeis University, Waltham, MA 02254-9110, USA

Pollack (1991) demonstrated that second-order recurrent neural networks can act as dynamical recognizers for formal languages when trained on positive and negative examples, and observed both phase transitions in learning and interacted function system–like fractal state sets. Followon work focused mainly on the extraction and minimization of a finite state automaton (FSA) from the trained network. However, such networks are capable of inducing languages that are not regular and therefore not equivalent to any FSA. Indeed, it may be simpler for a small network to fit its training data by inducing such a nonregular language. But when is the network’s language not regular? In this article, using a low-dimensional network capable of learning all the Tomita data sets, we present an empirical method for testing whether the language induced by the network is regular. We also provide a detailed ε-machine analysis of trained networks for both regular and nonregular languages. 1 Introduction Pollack (1991) showed one way a recurrent network may be trained to recognize formal languages from examples. The resulting networks often displayed complex limit dynamics, which were fractal in nature (Kolen, 1993). Alternative architectures had been employed earlier for related tasks (Jordan, 1986; Pollack, 1987; Cleeremans, Servan-Schreiber, & McClelland, 1989). Others have been proposed since (Watrous & Kuhn, 1992; Frasconi, Gori, & Soda, 1995; Zeng, Goodman, & Smyth, 1994; Das & Mozer, 1994; Forcada & Carrasco, 1995), and a number of approaches to analyzing recurrent networks have been developed. One of the principal themes has been the use of clustering and minimization techniques to extract a finite state automaton (FSA) that approximates the network’s behavior (Giles et al., 1992; Manolios & Fanelli, 1994; Tino ˇ & Sajda, 1995). Casey (1996) showed that if the network robustly models an FSA, the method proposed in Giles et al. (1992) will successfully extract the FSA given a fine enough resolution. In many cases, however, the language induced by the network is not regular, and therefore it cannot be modeled exactly by any FSA. In order to analyze the behavior of these networks, we need a Neural Computation 9, 1127–1142 (1997)

c 1997 Massachusetts Institute of Technology °

1128

Alan D. Blair and Jordan B. Pollack

Figure 1: System architecture.

method for testing empirically whether the network’s behavior is regular (i.e., it is modeling and FSA). In this article, we show how to perform an analysis of the trained network at high resolution by constructing a series of finite state approximations that describe its behavior at progressively more refined levels of detail. With this analysis we are able to measure empirically when the network has crossed over the boundary between regular and nonregular behavior. 2 Architecture Our system comprises a gated recurrent neural network with two subnetworks W0 and W1 , a gating device, and a perceptron P (see Figure 1). For jk σ = 0, 1 we denote by Wσ the real-valued weights of subnetwork Wσ and by wσ the function it implements. To operate the network, we first feed the initial vector x0 into the input layer. Then, as each symbol of a binary string 6 = σ1 , . . . , σn is read in, a gating device chooses one of the two subnetworks W0 or W1 , depending on whether the next symbol σt is equal to 0 or 1. The current real-valued

Analysis of Dynamical Recognizers

1129

hidden unit state vector xt−1 is then fed through this subnetwork Wσt to produce the next state xt = wσt (xt−1 ), given by Ã ! d X j j0 jk k Wσt xt−1 , for 1 ≤ j ≤ d, 1 ≤ t ≤ n. xt = tanh Wσt + k=1

This part of the architecture is equivalent to the second-order recurrent networks used in Pollack (1991) and Giles et al. (1992), with a slightly different notation. After the whole string has been read, the final state xn is fed through a perceptron P producing output   d X j Pj xn  , z = tanh P0 + j=1

where each Pj is a real-valued weight and z is the real-valued output. To make the geometry more accessible, we use the interval [−1, 1 ] for network activations and adopt the hyperbolic tangent as our transfer function, which is equivalent by rescaling to the usual sigmoid function. The object of training is that the output z should be close to +1 for accept strings and close to −1 for reject strings (“close” meaning within, say, 0.4). The class of languages recognizable by such networks lies somewhere between the regular and recursive language families and is known to contain some languages that are not context free (Siegelmann & Sontag, 1992). In the work reported here the state-space dimension d was always equal to 2, which gives the model a total of 17 free parameters: 3 for the perceptron, 6 for each subnetwork, and 2 for the initial point. As outlined below, this architecture was adequate to the task of learning any of the data sets from Tomita (1982) within a few hundred training epochs. 3 Analysis Our purpose here is not only to train the network but also to gain some insight into how it accomplishes its assigned task (see Casey, 1996, for another, related approach to this problem). If the recurrent layer has d nodes, the state space of the system is the d-dimensional hypercube X = [−1, 1]d . However, we need not be concerned with the whole state space, but only with that portion of it, which we call A0 , that is accessible from the initial point x0 via repeated applications of the maps w0 and w1 . Formally, we may define A0 thus: 1. Y0 = {x0 }. 2. For i ≥ 0,

Yi+1 = Yi ∪ w0 (Yi ) ∪ w1 (Yi )

3. A0 = lim Yi . i→∞

(Note: Yi+1 ⊇ Yi ). (3.1)

1130

Alan D. Blair and Jordan B. Pollack

Once we have identified the space A0 on which the dynamics take place, the next step is to find an appropriate subdivision of this space—a finite collection of disjoint nonempty subsets covering A0 —that is compatible with the maps w0 and w1 . The coarsest possible subdivision is provided by the perceptron, which divides A0 into the subset Uacc of “accept” points and the subset Urej of “reject” points. So we take as our first subdivision S1 = {Uacc , Urej }. If S = {U1 , . . . , Um } is a subdivision, the preimage of S under w0 is −1 −1 w−1 0 (S ) = {w0 U1 , . . . , w0 Um } \ {∅},

where w−1 0 Uj = {x ∈ A0 | w0 (x) ∈ Uj } (and similarly for w−1 1 (S )). The join R ∨ S of two partitions R and S is

R ∨ S = {U ∩ V | U ∈ R, V ∈ S } \ {∅}. Using these tools, we define a series of subdivisions {Si }i≥1 by 1. S1 = {Uacc , Urej }. −1 2. For i ≥ 1, Si+1 = Si ∨ w−1 0 (Si ) ∨ w1 (Si ).

We are now ready to construct a series of nondeterministic FSAs {Mi }i≥1 that approximate the behavior of the network. For each i, let Mi be as follows: 1. The states of Mi are indexed by the elements Uj in the subdivision Si . 2. The initial state is that Uj which contains x0 . 3. The accept states are those Uj contained in Uacc . 4. An edge labeled by 0 connects Uj to Uk exactly when Uj ∩ w−1 0 Uk 6= ∅. 5. An edge labeled by 1 connects Uj to Uk exactly when Uj ∩ w−1 1 Uk 6= ∅. Each FSA in the series is a refinement of the previous one, which sketches out some of the details of its nondeterminism. Proposition. If Mi is deterministic for some i, then Mj will equal Mi for all j > i and Mi will exactly model the network’s behavior. Proof. Suppose Mi is deterministic for some i and consider the input string 6 = σ1 , . . . , σn , with σj ∈ {0, 1} for 1 ≤ j ≤ n. We must show that 6 is accepted by Mi if and only if it is accepted by the network. Recall that x0 denotes the initial point, and let xj = wσj (xj−1 ), for 1 ≤ j ≤ n. Recall that the states of Mi are indexed by the elements Uk of subdivision Si , and for 0 ≤ j ≤ n let U (j) be that subset Uk that contains xj . Then in particular U (0)

Analysis of Dynamical Recognizers

1131

contains x0 and is therefore the initial state of Mi . Since xj = wσj (xj−1 ) we have (j) U (j−1) ∩ w−1 6= ∅, σj U

so U (j−1) is connected to U (j) with an edge labeled by σj . As Mi is deterministic, this must be the only such edge emanating from U (j−1) . By induction on j, it follows that Mi must be put into state U (j) when string σ1 , . . . , σj is input. In particular, once the whole string has been input, Mi will be in state U (n) , so 6 is accepted by Mi

⇔ ⇔ ⇔

U (n) ⊆ Uacc xn ∈ Uacc 6 is accepted by the network.

To show that Mi+1 = Mi and no further subdivisions are made, suppose to the contrary that there are two points x1 and x2 that lie in the same element Uj −1 of the partition Si , but in different elements of Si+1 = Si ∨w−1 0 (Si )∨w1 (Si ). Then it must be for σ = either 0 or 1 that wσ (x1 ) lies in some Uk ∈ Si while wσ (x2 ) lies in Ul with Uk 6= Ul . But then Uj would be connected to both Uk and Ul with edges labeled by σ , contradicting the assumption that Mi was deterministic. If the induced language is not regular, the machines Mi in theory will continue to grow indefinitely in complexity. Of course in practice we cannot implement the above procedure exactly for our trained networks but must instead use a discrete approximation. Following the analysis of Giles et al. (1992) we can approximate A0 computationally along the lines of equation 3.1 as follows. First, “discretize” the system at resolution r by dividing the state space into r d points in a regular lattice and approximating w0 and ˆ is the w1 with functions wˆ 0 and wˆ 1 that act on these points such that wˆ σ (x) ˆ Each Yi will then be represented by a finite set nearest lattice point to wσ (x). Yˆ i , and the condition Yˆ i+1 ⊇ Yˆ i guarantees that the procedure will terminate to produce a discrete approximation Aˆ 0 for A0 . In discrete form, the above procedure may be equated with the Hopcroft minimization algorithm (Hopcroft and Ullman, 1979), or the method of εmachines (Crutchfield & Young, 1990; Crutchfield, 1994), and was first used in the present context by Giles et al. (1992). Using small values of r, their aim was to extract an FSA that might not model the network’s behavior exactly but would model it closely enough to classify the training data faithfully. We had in mind a different purpose: to test empirically whether the network is regular (modeling an FSA) and to analyze its behavior in more fine-grained detail. The minimization procedure described above is bound to terminate in finite time due to the finite resolution of the representation.

1132

Alan D. Blair and Jordan B. Pollack

Table 1: Epochs to Learning (L) and Regularity (R) for the Seven Tomita Languages. Network

N1 L/R

N2 N3 L

L

N4 L

N5 R

L/R

N6 L

N7 R

L

Epochs to learning 200 600 400 200 — 800 200 — 600 Epochs to regularity 200 — — — 400 800 — 600 — 200 × 200 FSA size 2 341 272 2052 4 21 7123 3 264 500 × 500 FSA size 2 881 808 7248 4 7 39200 3 806 Tomita’s FSA size 2 3 5 — 4 4 — 3 5

However, as the resolution is scaled up, the size of the largest FSA generated should in principle stabilize in the case of regular networks and grow rapidly in the case of nonregular ones. Therefore, as an empirical test of regularity, we perform these computations at two different resolutions (200 × 200 and 500 × 500) and report (in rows 4 and 5 of Table 1) the size (number of states) of the largest FSA generated. If the largest FSAs generated at the different resolutions are the same, we take this as an indication that the network is regular; if they are vastly different, that it is not regular. This allows us to probe the nature of the network’s behavior as the training progresses. In some cases the network undergoes a phase transition to regularity at some point in its training, and we are able to measure with considerable precision the time at which this phase transition occurs. It should be stressed, however, that these discretized computations do not constitute a formal proof but merely provide strong empirical evidence as to the regularity or otherwise of the network and its induced language. More rigorous results may be found in Casey (1996, 1993), who described the general dynamical properties that a neural network or any dynamical system has to have in order to model an FSA robustly. Tino, ˇ Horne, Giles, & Collingwood (1995) also made a detailed analysis of networks with a two-dimensional state space under certain assumptions about the sign and magnitude of the weights. It is important also to note that the regularity we are testing for is a property of the trained network and not an intrinsic property of the input strings, since the (necessarily finite) training data may be generalized in an infinite number of ways, each producing an equally valid induced language that may be regular or nonregular. Good symbolic algorithms already exist for finding a regular language compatible with given input data (Trakhtenbrot & Barzdin’, 1973; Lang, 1992). Our purpose is rather to analyze the kinds of languages that a dynamical system such as a neural network is likely to come up with when trained on those data. We do not claim that our methods are efficient when scaled up to higher dimensions. Rather, we hope that a detailed analysis of networks in low dimensions will lead to a better understanding of the phenomenon of regular versus nonregular network behavior in general.

Analysis of Dynamical Recognizers

1133

Finally, we remark that the functions w0 and w1 map X continuously into itself and as such define an iterated function system (IFS) (Barnsley, 1988) as noted in Kolen (1994). The attractor A of this IFS may be defined as follows: 1. Z0 = X . 2. For i ≥ 0, Zi+1 = w0 (Zi ) ∪ w1 (Zi ) T 3. A = i≥0 Zi .

[by induction Zi+1 ⊆ Zi ].

As with A0 , we can find a discrete approximation Aˆ for A at any given resolution. We have found empirically that the convergence to the attractor was very rapid for our trained networks so that Aˆ 0 was equal to a subset of Aˆ plus a very small number of “transient” points. For this reason we call A0 a subattractor of the IFS. If w0 and w1 were contractive maps, A0 would contain the whole of A (Barnsley, 1988), but in our case they are generally not contractive, so Aˆ 0 may contain only part of Aˆ . The close relationship between Aˆ 0 and Aˆ should make it possible to analyze the general family of languages accepted by the network as x0 varies, though we do not pursue this line of inquiry here. 4 Results Networks with the architecture described in section 2 were trained to recognize formal languages using backpropagation through time (Williams & Zipser, 1989; Rumelhart, Hinton, & Williams, 1986), with a modification similar to Quickprop (Fahlman, 1989)1 and a learning rate of 0.03. The weights were updated in batch mode, and the perceptron weights Pj were Pd constrained by rescaling to the region where j=1 Pj 2 ≤ 1. In contrast to Pollack (1991) where the backpropagation was truncated, we backpropagated through all levels of recurrence, as in Williams and Zipser (1989). In addition, we allowed the initial point x0 to vary as part of the backpropagation, a modification that was also developed in independent work by Forcada and Carrasco (1995). Our seven groups of training strings (see appendix) were copied exactly from Tomita (1982) except that we did not include the empty string in our training sets. Rows 4 and 5 of Table 1 show the size (number of states) of the largest FSA generated from the trained networks at two different resolutions 1

Specifically, the cost function we used was

³

1+z 1 E = − (1 + s)2 log 2 1+s

´

−

³

1−z 1 (1 − s)2 log 2 1−s

´

+ s(z − s)

(where z is the actual output and s the desired output), which leads to a delta rule of δ = (1 − sz)(s − z).

1134

Alan D. Blair and Jordan B. Pollack

Figure 2: Subattractor, weights, and equivalent FSA for network N1.

by the methods described in section 3. For comparison, row 6 shows the size of the minimal FSAs that Tomita found by exhaustive search. A network can be said to have learned all its training data once the output is positive for all accept strings and negative for all reject strings (a max error of 1.0), but it makes sense to aim for a lower maximum error in order to provide a safety margin. For network N5, the maximum error reached its lowest value of 0.46 after 800 epochs. The other networks were trained until they achieved a maximum error of less than 0.4, the number of epochs required being shown in row 2 (epochs to learning). Since the test for regularity is computationally intensive, we did not test our networks at every epoch but only at intervals of 200 epochs. For network N1, the derived FSA is the same for both resolutions, providing evidence that the induced language is regular. For networks N2, N3, N4, N6, and N7 on the other hand, the maximum FSA grows dramatically as we increase the resolution, suggesting that the induced language is not regular. We continued to train these networks to see if they would later become regular and found that networks N4 and N6 became regular after 400 and 600 epochs, respectively, as indicated in row 3 (epochs to regularity), while networks N2, N3, and N7 remained nonregular even after 10,000 epochs. For N5, the size of the maximum FSA actually decreased with higher resolution from 21 to 7, suggesting that the induced language is regular but that high resolution is required to detect its regularity. As an illustration, Figure 2 shows the subattractor for N1, which learned its training data after 200 epochs. The axes measure the activations of hidden nodes x1 and x2 , respectively. The initial point x0 is indicated by a cross (upper-right corner). Also shown is the dividing line between accept points and reject points. Roughly speaking, w0 squashes the entire state space into

Analysis of Dynamical Recognizers

1135

Figure 3: The phase transition of N4.

the lower-left corner, while w1 flattens it out onto the line x1 = x2 . The dividing line separates A0 into two pieces: Uacc in the upper-right corner and Urej in the lower left. w1 maps each piece back into itself, while w0 maps both pieces into Urej . The ε-machine method thus produces the two-state FSA shown at the right. (Final states are indicated by a double circle, the initial state by an open arrow.) In this simple case the FSA can be shown to model the network’s behavior exactly. Of course, N1 was trained on an extremely easy task that could have been learned with even fewer weights. Figure 3 shows network N4 captured at its phase transition. After 371 epochs (left) it has learned the training set, but the induced language is not regular. At the next epoch (right), it has become regular. Although the weights have been shifted by only 0.004 units in euclidean space, the dynamics of the two networks are quite different. Applied to the left network at 500 × 500 resolution, our analysis produced a series of FSAs of maximum size 2564. Applied to the right network, it terminated after three steps to produce the same four-state FSA that Tomita (1982) found by exhaustive search. The states of the FSA correspond to the following four subsets of A0 : U1 in the upper-left corner, U2 around (−0.6, −0.9), U3 barely visible at (0.4, −1.0), and U4 in the lower-right corner. The details of the successive subdivisions are outlined in Figure 4.

1136

Figure 4: The analysis of N4.

Alan D. Blair and Jordan B. Pollack

Analysis of Dynamical Recognizers

1137

Figure 5: N2, which is not regular, and N5, which is regular but with no obvious clusters.

The above examples would also be amenable to previous, clusteringbased approaches because of the way they “partition [their] state space into fairly well-separated, distinct regions or clusters” as hypothesized by Giles et al. (1992). Those shown in Figure 5 seem to be trickier. The subattractor for N5 (right) appears to be a bunch of scattered points with no obvious clustering, yet our fine-grained analysis was able to extract from it a seven-node FSA—a little larger than the minimal FSA of four nodes Tomita found. Network N2 (left) seems to have induced a nonregular language. Figure 6 shows the first four iterations of analysis applied to it. Note that each FSA refines the previous one, bringing to the light more details. Much can be learned about the induced language by examining these finite state approximations. For example, we can infer from M2 that the network rejects all strings ending in 1 and from M3 that it accepts all nonempty strings of the form (10)∗ but rejects any string ending in 110.

1138

Alan D. Blair and Jordan B. Pollack

0,1

M1

0,1 0

0,1

M2

1 0,1

0

0,1

1

0 1

M3

0,1

1

1

0

0

0,1

0,1

0

0,1

M4

0 1

1 0,1

1

0

1

1

1

0

0

0

1

1

0

0,1

1

0,1 0

1

1

0,1

1 0

0

Figure 6: The first four steps of analysis applied to N2.

5 Conclusion By allowing the decision boundary and the initial point to vary, our networks with two hidden nodes were able to induce languages from all the data sets of Tomita (1982) within a few hundred epochs. Many researchers implicitly regard an extracted FSA as superior to the trained network from which it was extracted (Omlin and Giles, 1996) with regard to predictability, compactness of description, and the particular way each generalizes to classify new, unseen input strings. For this reason, earlier work in the field focused on extracting an FSA that approximates the

Analysis of Dynamical Recognizers

1139

behavior of the network. However, that approach is imprecise if the network has induced a nonregular language and does not exactly model an FSA. We have provided a fine-grained analysis for a number of trained networks, both regular and nonregular, using an approach similar to the method of ε-machines that Crutchfield and Young (1990) used to analyze certain handcrafted dynamical systems. In particular, we were able to measure empirically whether the induced language was regular. The fact that several of the networks induced nonregular languages suggests a discrepancy between languages that are “simple” for dynamical recognizers and those that are “simple” from the point of view of automata theory (the regular languages). It is easier for these learning systems to induce a nonregular language to fit the sparse data of the Tomita training sets rather than the expected minimal regular language. The use of comprehensive training sets, or intentional heuristics, might constrain networks away from these interesting dynamics. It could be argued that the network and FSA ought to be seen on a more equal footing, since the 17 parameters of the network provide a compactness of description comparable to that of the FSA, and the language induced by the network is in principle on a par with that of the FSA in the sense that both generalize the same training data. We hope that further work in this direction may lead to a better understanding of network dynamics and help to clarify, compare, and contrast the relative merits of symbolic and dynamical systems.

Appendix: Tomita’s Data Sets N1 Accept

N1 Reject

N2 Accept

N2 Reject

1 11 111

0 10 01

10 1010 101010

1 0 11

1111 11111 111111 1111111 11111111

00 011 110 11111110 10111111

10101010 10101010101010

00 01 101 100 1001010 10110 110101010

N3 Accept

N3 Reject

N4 Accept

N4 Reject

1 0

10 101

1 0

000 11000

01 11 00 100 110 111 000 100100 110000011100001

010 1010 1110 1011 10001 111010 1001000 11111000 0111001101

10 01 00 100100 001111110100 0100100100 11100 0010

0001 000000000 11111000011 1101010000010111 1010010001 0000 00000

111101100010011100

11011100110

1140

Alan D. Blair and Jordan B. Pollack

N5 Accept

N5 Reject

N6 Accept

N6 Reject

11 00 1001 0101 1010 1000111101

0 111 010 000000000 1000 01

10 01 1100 101010 111 000000

1 0 11 00 101 011

1001100001111010 111111 0000

10 1110010100 010111111110 0001 011

10111 0111101111 100100100

11001 1111 00000000 010111 101111011111 1001001001

N7 Accept

N7 Reject

1 0 10 01

1010 00110011000 0101010101 1011010

11111 000 00110011 0101 0000100001111

10101 010100 101001 100100110101

00100 011111011111 00

Acknowledgments This research was funded by a Krasnow Foundation Postdoctoral Fellowship, by ONR grant N00014-95-0173, and by NSF grant IRI-95-29298. We are indebted to David Wittenberg for helping to improve its presentation and to Mike Casey for stimulating discussions. References Barnsley, M. F. (1988). Fractals everywhere. San Diego, CA: Academic Press. Casey, M. (1993). Computation dynamics in discrete-time recurrent neural networks. Proceedings of the Annual Research Symposium of UCSD Institute for Neural Computation, 78–95. Casey, M. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Computation, 8(6). Cleeremans, A., Servan-Schreiber, D., & McClelland, J. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1(3), 372–381. Crutchfield, J. P. (1994). The calculi of emergence: Computation, dynamics and induction. Physica D, 75, 11–54. Crutchfield, J. P., & Young, K. (1990). Computation at the onset of chaos. In W. H. Zurek (Ed.), Complexity, Entropy and the Physics of Information. Reading, MA: Addison-Wesley. Das, S., & Mozer, M. C. (1994). A unified gradient-descent/clustering architecture for finite state machine induction. Neural Information Processing Systems, 6, 19–26.

Analysis of Dynamical Recognizers

1141

Fahlman, S. E. (1989). Fast-learning variations on back-propagation: An empirical study. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 38–51). San Mateo, CA: Morgan Kaufmann. Forcada, M. L., & Carrasco, R. C. (1995). Learning the initial state of a secondorder recurrent neural network during regular-language inference. Neural Computation, 7(5), 923–930. Frasconi, P., Gori, M., & Soda, G. (1995). Recurrent neural networks and prior knowledge for sequence processing: A constrained nondeterministic approach. Knowledge Based Systems, 8(6), 313–332. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting finite state automata with second-order recurrent neural networks. Neural Computation, 4(3), 393–405. Hopcroft, J. E., Ullman, J. D. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-Wesley. Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. Proceedings of the Eighth Conference of the Cognitive Science Society. Amherst, MA, 531–546. Kolen, J. F. (1993). Fool’s gold: Extracting finite state machines from recurrent network dynamics. Neural Information Processing Systems, 6, 501–508. Kolen, J. F. (1994). Exploring the computational capabilities of recurrent neural networks. Unpublished doctoral dissertation, Ohio State University. Lang, K. J. (1992). Random DFA’s can be approximately learned from sparse uniform examples. Proc. Fifth ACM Workshop on Computational Learning Theory, 45–52. Manolios, P., & Fanelli, R. (1994). First order recurrent neural networks and deterministic finite state automata. Neural Computation, 6(6), 1155–1173. Omlin, C. W., & Giles, C. L. (1996). Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1), 41. Pollack, J. B. (1987). Cascaded back propagation on dynamic connectionist networks. Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, 391–404. Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7, 227–252. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. Siegelmann, H. T., & Sontag, E. D. (1992). On the computational power of neural networks. Proceedings of the Fifth ACM Workshop on Computational Learning Theory, Pittsburgh, PA. Tino, ˇ P., Horne, B. G., Giles, C. L., & Collingwood, P. C. (1995). Finite state machines and recurrent neural networks—Automata and dynamical systems approaches (Tech. Rep. No. UMIACS-TR-95-1). College Park, MD: Institute for Advanced Computer Studies, University of Maryland. Tino, ˇ P., & Sajda, J. (1995). Learning and extracting initial mealy automata with a modular neural network model. Neural Computation, 7(4), 822. Tomita, M. (1982). Dynamic construction of finite-state automata from examples

1142

Alan D. Blair and Jordan B. Pollack

using hill-climbing. Proceedings of the Fourth Annual Cognitive Science Conference, Ann Arbor, MI, 105–108. Trakhtenbrot, B. A., & Barzdin’, Ya. M. (1973). Finite automata: Behavior and synthesis. Amsterdam: North-Holland. Watrous, R. L., & Kuhn, G. M. (1992). Induction of finite state languages using second-order recurrent networks. Neural Computation, 4(3), 406–414. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270. Zeng, Z., Goodman, R. M., & Smyth, P. (1994). Learning finite state machines with self-clustering recurrent networks. Neural Computation, 5(6), 976–990.

Received September 19, 1995; accepted July 30, 1996.

Communicated by David Haussler

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split Michael Kearns AT&T Research Labs, Murray Hill, NJ 07974, U.S.A.

We give a theoretical and experimental analysis of the generalization error of cross validation using two natural measures of the problem under consideration. The approximation rate measures the accuracy to which the target function can be ideally approximated as a function of the number of parameters, and thus captures the complexity of the target function with respect to the hypothesis model. The estimation rate measures the deviation between the training and generalization errors as a function of the number of parameters, and thus captures the extent to which the hypothesis model suffers from overfitting. Using these two measures, we give a rigorous and general bound on the error of the simplest form of cross validation. The bound clearly shows the dangers of making γ —the fraction of data saved for testing—too large or too small. By optimizing the bound with respect to γ , we then argue that the following qualitative properties of cross-validation behavior should be quite robust to significant changes in the underlying model selection problem: • When the target function complexity is small compared to the sample size, the performance of cross validation is relatively insensitive to the choice of γ . • The importance of choosing γ optimally increases, and the optimal value for γ decreases, as the target function becomes more complex relative to the sample size. • There is nevertheless a single fixed value for γ that works nearly optimally for a wide range of target function complexity. 1 Introduction In this article, we analyze the performance of cross validation in the context of model selection and complexity regularization.1 We work in a setting 1 Perhaps in conflict with accepted usage in statistics, where the method is often referred to as keeping a “hold-out” set, here we use the term cross validation to mean the simple method of saving out an independent test set to perform model selection. Precise definitions will be stated shortly.

Neural Computation 9, 1143–1161 (1997)

c 1997 Massachusetts Institute of Technology °

1144

Michael Kearns

in which we must choose the right number of parameters for a hypothesis function in response to a finite training sample, with the goal of minimizing the resulting generalization error. There is a large and interesting literature on cross validation methods, which often emphasizes asymptotic statistical properties, or the exact calculation of the generalization error for simple models. (The literature is too large to survey here; foundational papers include Stone, 1974, 1977.) Our approach here is somewhat different and is primarily inspired by two sources. The first is the work of Barron and Cover (1991), who introduced the idea of bounding the error of a model selection method (in their case, the minimum description length principle) in terms of a quantity known as the index of resolvability. The second is the work of Vapnik (1982), who provided extremely powerful and general tools for uniformly bounding the deviation between the training and generalization error. (For related methods and problems, we refer the reader to Devroye (1988).) We combine these methods to give a new and general analysis of crossvalidation performance. In the first and more formal part of the article, we give a rigorous bound on the error of cross validation in terms of two parameters of the underlying model selection problem: the approximation rate and the estimation rate. Taken together, these two problem parameters determine our analog of the index of resolvability, and the work of Vapnik and others yields estimation rates applicable to many natural problems. In the second and more experimental part of the article, we investigate the implications of our bound for choosing γ , the fraction of data withheld for testing in cross validation. The most interesting aspect of this analysis is the identification of several qualitative properties of the optimal γ that appear to be invariant over a wide class of model selection problems. 2 The Formalism We consider model selection as a two-part problem: choosing the appropriate number of parameters for the hypothesis function and tuning these parameters. The training sample is used in both steps of this process. In many settings, the tuning of the parameters is determined by a fixed learning algorithm such as backpropagation, and then model selection reduces to the problem of choosing the architecture (which determines the number of weights, the connectivity pattern, and so on). For concreteness, we adopt an idealized version of this division of labor. We assume a nested sequence of function classes H1 ⊂ · · · ⊂ Hd · · ·, called the structure (Vapnik, 1982), where Hd is a class of boolean functions of d parameters, each function being a mapping from some input space X into {0, 1}. For simplicity, in this article we assume that the Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1971; Vapnik, 1982) of the class Hd is O(d). To remove this assumption, one simply replaces all occurrences of d in our bounds by the VC dimension of Hd . We assume that we have in our possession a learning

Bound on the Error of Cross Validation

1145

algorithm L that on input any training sample S and any value d will output a hypothesis function hd ∈ Hd that minimizes the training error over Hd — that is, ²t (hd ) = minh∈Hd {²t (h)}, where ²t (h) is the fraction of the examples in S on which h disagrees with the given label. In many situations, training error minimization is known to be computationally intractable, leading researchers to investigate heuristics such as backpropagation. The extent to which the theory presented here applies to such heuristics will depend in part on the extent to which they approximate training error minimization for the problem under consideration. Model selection is thus the problem of choosing the best value of d. More precisely, we assume an arbitrary target function f (which may or may not reside in one of the function classes in the structure H1 ⊂ · · · ⊂ Hd · · ·), and an input distribution P; f and P together define the generalization error function ²g (h) = Prx∈P [h(x) 6= f (x)]. We are given a training sample S of f , consisting of m random examples drawn according to P and labeled by f (with the labels possibly corrupted by a noise process that randomly complements each label independently with probability η < 1/2). In many model selection methods, such as Rissanen’s minimum description length (MDL) principle (Rissanen, 1989) and Vapnik’s guaranteed risk minimization (Vapnik, 1982), for each value of d = 1, 2, 3, . . . we give the entire sample S and d to the learning algorithm L to obtain the function hd minimizing the training error in Hd . A “penalty” for complexity (a function of d and m) is then added to the training errors ²t (hd ) to choose among the sequence h1 , h2 , h3 , . . .. Whatever the method, we interpret the goal to be that of minimizing the generalization error of the hypothesis selected. In this article, we will make the rather mild but very useful assumption that the structure has the property that for any sample size m, there is a value dmax (m) such that ²t (hdmax (m) ) = 0 for any labeled sample S of m examples.2 We call the function dmax (m) the fitting number of the structure. The fitting number formalizes the simple notion that with enough parameters, we can always fit the training data perfectly, a property held by most sufficiently powerful function classes (including multilayer neural networks). We typically expect the fitting number to be a linear function of m or at worst a polynomial in m. The significance of the fitting number for us is that no reasonable model selection method should choose hd for d ≥ dmax (m), since doing so simply adds complexity without reducing the training error. In this article we concentrate on the simplest version of cross validation. Unlike the methods mentioned above, which use the entire sample for training the hd , in cross validation we choose a parameter γ ∈ [0, 1], which determines the split between training and test data. Given the input sample S of m examples, let S0 be the subsample consisting of the first (1 − γ )m examples in S, and S00 the subsample consisting of the last γ m examples. In

2

Weaker definitions for dmax (m) also suffice, but this is the simplest.

1146

Michael Kearns

cross validation, rather than giving the entire sample S to L, we give only the smaller sample S0 , resulting in the sequence h1 , . . . , hdmax ((1−γ )m) of increasingly complex hypotheses. Each hypothesis is now obtained by training on only (1 − γ )m examples, which implies that we will only consider values of d smaller than the corresponding fitting number dmax ((1 − γ )m); let us γ introduce the shorthand dmax for dmax ((1 − γ )m). Cross validation chooses the hd satisfying hd = mini∈{1,...,dγmax } {²t00 (hi )} where ²t00 (hi ) is the error of hi on the subsample S00 . Notice that we are not considering multifold cross validation, or other variants that make more efficient use of the sample, because our analyses will require the independence of the test set. However, we believe that many of the themes that emerge here may apply to these more sophisticated variants as well. We use ²cv (m) to denote the generalization error ²g (hd ) of the hypothesis hd chosen by cross validation when given as input a sample S of m random examples of the target function. Obviously, ²cv (m) depends on S, the structure, f , P, and the noise rate. When bounding ²cv (m), we will use the expression with high probability to mean with probability 1 − δ over the sample S, for some small fixed constant δ > 0. All of our results can also be stated with δ as a parameter at the cost of a log(1/δ) factor in the bounds, or in terms of the expected value of ²cv (m). 3 The Approximation Rate It is apparent that any nontrivial bound on ²cv (m) must take account of some measure of the complexity of the unknown target function f . The correct measure of this complexity is less obvious. Following the example of Barron and Cover’s (1991) analysis of MDL performance in the context of density estimation, we propose the approximation rate as a natural measure of the complexity of f and P in relation to the chosen structure H1 ⊂ · · · ⊂ Hd · · ·. Thus we define the approximation rate function ²g (d) to be ²g (d) = minh∈Hd {²g (h)}. The function ²g (d) tells us the best generalization error that can be achieved in the class Hd , and it is a nonincreasing function of d. If ²g (s) = 0 for some sufficiently large s, this means that the target function f , at least with respect to the input distribution, is realizable in the class Hs , and thus s is a coarse measure of how complex f is. More generally, even if ²g (d) > 0 for all d, the rate of decay of ²g (d) still gives a nice indication of how much representational power we gain with respect to f and P by increasing the complexity of our models. Still missing, of course, is some means of determining the extent to which this representational power can be realized by training on a finite sample of a given size, but this will be added shortly. First we give examples of the approximation rate that we will examine in some detail following the general bound on ²cv (m). 3.1 The Intervals Problem. In this problem, the input space X is the real interval [0, 1], and the class Hd of the structure consists of all boolean step

Bound on the Error of Cross Validation

1147

Three Approximation Rates 1

0.8

0.6

y

0.4

0.2

0

50

100

150 d

200

250

300

Figure 1: Plots of three approximation rates. For the intervals problem with target complexity s = 250 intervals (linear plot intersecting d-axis at 250), for the Perceptron problem with target complexity s = 150 nonzero weights (nonlinear plot intersecting d-axis at 150), and for power law decay asymptoting at ²min = 0.05.

functions over [0, 1] of at most d steps; thus, each function partitions the interval [0, 1] into at most d disjoint segments (not necessarily of equal width) and assigns alternating positive and negative labels to these segments. The input space is one-dimensional, but the structure contains arbitrarily complex functions over [0, 1]. It is easily verified that our assumption that the VC dimension of Hd is O(d) holds here, and that the fitting number obeys dmax (m) ≤ m. Now suppose that the input density P is uniform, and suppose that the target function f is the function of s alternating segments of equal width 1/s, for some s (thus, f lies in the class Hs ). We will refer to these settings as the intervals problem. Then the approximation rate is ²g (d) = (1/2)(1 − d/s) for 1 ≤ d < s and ²g (d) = 0 for d ≥ s (see Figure 1). Thus, as long as d < s, increasing the complexity d gives linear payoff in terms of decreasing the optimal generalization error. For d ≥ s, there is no payoff for increasing d, since we can already realize the target function. The reader can easily verify that if f lies in Hs but does not have equal width intervals, ²g (d) is still piecewise linear, but for d < s is concave up: the gain in approximation obtained by incrementally increasing d diminishes as d becomes larger. Although the intervals problem is rather simple and artifi-

1148

Michael Kearns

cial, a precise analysis of cross-validation behavior can be given for it, and we will argue that this behavior is representative of much broader and more realistic settings. 3.2 The Perceptron Problem. In this problem, the input space X is

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

Recommend Documents

No title

Contents Introduction Dean Wesley Smith Star Trek® “A Test of Character” Kevin Lauderdale “Indomitable” Kevin Killiany “...

No title

METHODS IN CELL PHYSIOLOGY VOLUME I1 This Page Intentionally Left Blank Methods in Cell Physiology Edited by DAVID...

No title

No title

HYPATI A SPECIAL ISSUE FrenchFeministPhilosophy WINTER1989 A Journalof FeministPhilosophy HYPATI A SPECIAL ISSUE Fre...

No title

No title

No title

No title

JOURNAL OF ANCIENT NEAR EASTERN RELIGIONS JOURNAL OF ANCIENT NEAR EASTERN RELIGIONS Aims & Scope The Journal of Ancien...

No title

6 INVESTIGACION Ingeniería hidráulica CIENCIA en el México prehistórico s. Christopher Caran y James E. Neely Ed ic...

No title